US20020152078A1

US20020152078A1 - Voiceprint identification system

Info

Publication number: US20020152078A1
Application number: US10/046,824
Authority: US
Inventors: Matt Yuschik; Robert Slezak
Original assignee: Individual
Current assignee: Individual
Priority date: 1999-10-25
Filing date: 2002-01-17
Publication date: 2002-10-17
Also published as: US6356868B1

Abstract

A voiceprint identification system identifies and verifies a user from voice data collected from a single interaction with the user. The voice data is a number, word phrase or any utterance chosen by the user. A first speech-processor processes the voice data to produce first match criteria, and a second speech-processor processes the same voice data to produce second match criteria, the first match criteria being different than the second match criteria. The first match criteria is used to select a subset of authorized persons, and, for each selected authorized person, the authorized person's voice template is retrieved from a database. The retrieved voice templates are individually compared to the second match criteria until either the second match criteria matches one of the retrieved voice templates or all the retrieved voice templates have been compared without matching the second match criteria.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No. 09/422,851, filed Oct. 25, 1999, now pending.[0001]

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is directed to access control systems, and, more particularly, to access control based on voiceprint identification.

2. Description of the Related Art

Access control systems are used to prevent unauthorized users from gaining access to protected resources, such as computers, buildings, automatic teller machines (ATMs), credit cards and voicemail systems. When a user attempts to access a protected resource, a typical access control system engages in one or more interactions with the user, such as prompting the user and requiring him to enter an identity of an authorized person (the user's purported identity) and a valid passcode (sometimes called a personal identification number or PIN). For example, a typical voicemail system requires a user to first enter his mailbox number and then a passcode by pressing keys on his telephone. Only if the entered passcode matches the passcode associated with the entered mailbox number is the user deemed to be a subscriber and allowed to further interact with the system, i.e. to access a restricted resource, a mailbox in this case and to retrieve messages or send messages to access a restricted resource, a mailbox in this case, and to other subscribers.

The advantages of using a voiceprint system over standard PIN number systems are several. First, it is quicker and more convenient to speak instead of having to punch codes into a numeric keypad. Also, if the user is required to enter his PIN number by pressing telephone keys, if she is not using a touch-tone phone or if she is using a phone that does not allow tone codes such as a cellular or cordless phone, then this would be impossible. Also a voiceprint system is more secure, as unlike the standard PIN number system, even if an impostor obtains a subscriber's passcode her voiceprint will not match or allow her to gain access.

Voiceprint access control systems use discriminating characteristics of each authorized person's voice to ascertain whether a user is authorized to access a protected resource. The sound of a speaker's voice is influenced by, among other things, the speaker's physical characters, including such articulators as the tongue, vocal tract size, and speech production manner, such as place and rate of articulation. When a user attempts to access a protected resource, a typical voiceprint system samples an utterance produced by the user and then compares the voiceprint of the utterance to a previously stored voiceprint of the authorized person, whom the user purports to be.

Voiceprint systems must be trained to recognize and differentiate each authorized person through her voice. This training involves sampling each authorized person's voice while she utters a predetermined word or phrase and then processing this speech sample to calculate a set of numeric parameters (commonly called the acoustic features in a “voice template” of the speaker's voice). This voice template is stored, along with other voice templates, in a database that is indexed (sometimes known as keyed) by the identity of the speaker.

The parameters of a voice template quantify certain biometric characteristics of the speaker's voice, such as amplitude, frequency spectrum and timing, while the speaker utters the predetermined word or phrase. A speaker's voice template is fairly unique, although not as unique as some other characteristics of the speaker, such as the speaker's fingerprint. For example, identical twins are likely to have nearly indistinguishable voices, because their vocal tracts are similarly shaped.

When a user attempts to gain access to a protected resource, the user enters his purported identity, and then a conventional voiceprint identification system uses this identity to index into the database and retrieves a single voice template, namely the voice template of the authorized person who the user purports to be. The system prompts the user to speak a predetermined word or phrase and samples the user's voice to create a voice template from the user's utterance. The system then compares the user's voice template to the authorized person's voice template using one or more well-known statistical decision-theoretic techniques. This comparison produces a binary (match/no match) result. If the two voice templates are sufficiently similar, the voice templates are said to match (as that term is used hereinafter) and the user is deemed to be the authorized person, otherwise the voice templates are said to not match and the user is deemed to be an impostor.

The statistical decision associated with hypothesis testing (match/no-match) is characterized by two types of errors: false rejection (Type I errors) and false acceptance (Type II errors). The algorithms used in the comparison are typically adjusted so that the likelihood of Type I errors is approximately equal to the likelihood of Type II errors.

There are two kinds of speech recognition technology packages. The first is Speaker-Independent (SI) speech recognition technology, which can recognize words and does not require training by the individual user. The disadvantage of SI technology is the active vocabulary of words it can generally recognize is limited to reduce errors and calculation time. The second type of speech recognition technology is Speaker-Dependent (SD) technology, which requires training of each word by each individual user but has significantly higher accuracy for the user.

Speaker Independent recognition focuses on common acoustic features of a sound, and attempts to match many instantiations of an utterance with one, common “prototype” of that utterance (many-to-one mapping). Speaker Dependent recognition focuses on acoustically differentiating the (possibly different) features so that one pattern can be selected from many similar patterns (one-from-many mapping). This is the “subscriber”, as described in this invention.

Prior art for voiceprint identification focuses SI technology is limited to confidently recognizing the unique acoustic pattern of the subscriber (one-from-one mapping). The invention permits more flexibility by combining the two ASR technologies so that SI ASR is used to identify a subset of subscribers (a cohort), and SD ASR is used to verify a particular member of the cohort (the subscriber).

As commonly used in the art, “identification” means ascertaining a user's purported identity, and “verification” means ascertaining if the user's voice matches the voice of the specified, e.g. identified, speaker.

Some prior art voiceprint identification systems assign a unique spoken passcode to each user, such as a random number or the user's social security number. Because each user has his own passcode, the voiceprint identification system can readily access the user's account once it identifies the user from his passcode.

Requiring each authorized person's password to be unique poses problems. For example, authorized persons cannot readily choose or change their passwords. Additionally, using preassigned numbers, such as a user's social security or telephone number, may pose security problems.

What is needed, therefore, is a voiceprint identification system that identifies and verifies a user from a single utterance, but that permits multiple authorized persons to have identical passwords. A system that simply derives a voice template of the user's voice sample and then exhaustively searches an entire database for a matching voice template would be slow, because this database stores a large quantity of data, associated with each and every authorized person. Furthermore, such a system would produce an unacceptably high rate of Type I or Type II errors. As the number of valid templates increases, the user's voice template is increasingly likely to be closer to one of the valid voice templates. Adjusting the comparison algorithms to reduce the likelihood of Type II (false acceptance) errors would raise the likelihood of Type I (false rejection) errors to an unacceptably high value.

Objectives of the invention include implementing more secure, efficient and user friendly voice identification systems.

The above objectives can be attained by a system that identifies and verifies a user from voice data collected from a single utterance by the user. At least two signal processors process the voice data, and each signal processor operates with different selection criterion. These selection criteria are used together to select at most one matching record of an authorized person from a database of authorized persons. Each individual selection criterion optimally partitions the database into two subsets of records: a subset of selected records and subset of non-selected records. The matching record is defined as the intersection of the subsets of selected records. If a single matching record is selected, the user is deemed to be identified and verified as the authorized person who corresponds to the matching record. On the other hand, if no matching record is selected, the user is deemed not to be an authorized person.

These together with other objectives and advantages, which will be subsequently apparent, reside in the details of construction and operation as more fully hereinafter described and claimed, reference being had to the accompanying drawings forming a part hereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an access control system according to the present invention. [0022]
FIG. 2 is a diagram of a subscriber record stored in the database of FIG. 1. [0023]
FIG. 3 is a flowchart of the serial embodiment of the process of identifying a user from his chosen passcode. [0024]
FIG. 4 is a flowchart of the parallel embodiment of the process of identifying a user from his chosen passcode. [0025]
FIG. 5 is a flowchart of the process of optimized comparison step. [0026]
FIG. 6 is a flowchart of the process of initializing a new user. [0027]
FIG. 7 is a flowchart of the process of leaving a message using the caller's cohort option. [0028]
FIG. 8 is a flowchart of the process of retrieving messages from a desired caller using the caller's cohort option.[0029]

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

By analogy, the invention selects a record matching a caller in a manner similar to selecting a playing card from a stack of cards by identifying a card suit and verifying a card number as selection criteria. Each of the two selection criteria can identify a subset containing more than one card, but the intersection of the subsets identifies at most one card. The invention, by analogy, selects an acoustic card from more than 4 suites, and more than 13 card values. [0030]
FIG. 1 illustrates an [0031] access control system 100, which is preferably referenced to a portion of a voicemail system (not shown). A database 102 stores records 104, each record corresponding to a subscriber of the voicemail system. As shown in FIG. 2, each record 104 contains a voice template 200 of the corresponding subscriber's voice, the subscriber's voicemail box number 202 and other information 204 necessary for the operation of the voicemail system. Returning to FIG. 1, in the database 102, the records 104 are indexed by speakers' passcodes rather than, or optionally in addition to, being indexed and grouped into a cohort (or subset) by their unique voice mailbox numbers. The database 102 allows multiple records to be indexed by identical keys, because multiple subscribers are permitted to have identical passcodes.
To identify and verify a [0032] user 106 in a single interaction, the access control system 100 prompts the user to speak her passcode, and two speech processors 108 and 110 process the user's response 112. The first speech processor 108 uses well-known SI speech recognition technology, such as is commercially available from Voice Control Systems of Dallas, Tex., to convert the spoken passcode 112 into a word phrase, number or other keying symbol 114. This word phrase or number 114 is used to index into the database 102 and retrieves a set of one or more records, the “cohort,” 116 which includes all the individual voice templates 200 that correspond to the subset of subscribers who have spoken the same word phrase or number, i.e. all the subscriber voice templates that correspond to the spoken passcode 112. The number of voice templates retrieved 116 is generally a small fraction of the total number of voice templates stored in the database 102, because passcodes are fairly unique, based on individual subscriber preferences. Preferably, the set of retrieved voice templates 116 is stored in a buffer (not shown) or a file prior to the comparison steps described below. Alternatively, the set of voice templates 116 can be retrieved and compared one at a time.
If no [0033] record 104 in the database 102 corresponds to the word phrase or number 114, the user 106 is prompted again, and the above-described process is repeated. Alternatively, a “second-best” word phrase or number proposed by the SI speech processor 108 can be used for the word phrase or number 114. This corresponds to selecting the alternative using different values of Type I and Type II error thresholds for the SI ASR process. If, after a predetermined number of retries, no record corresponds to the word phrase or number 114, the user 106 is prompted to enter his voicemail box number and passcode separately, and the user is identified and verified as in a conventional access control system.
The [0034] second speech processor 110 uses well-known SD voice recognition technology, such as is available from Voice Systems of Dallas, Texas, to calculate a separate second set of parameters, i.e. a voice template 118 of discriminating acoustic features, from the spoken passcode 112. A comparator 120 compares the calculated voice template 118 to each of the retrieved voice templates 116. If the calculated voice template 118 matches one of the retrieved voice templates 116 within the acceptance limits specified by the Type I and Type II error thresholds, the user 106 is considered identified and verified, and the speaker's voicemail box number 202, or other information 204, is retrieved from the database record that corresponds to the matching voice template. A match/no-match indicator 122 and, if appropriate, the voicemail box number and/or other information are sent to the rest of the voicemail system (not shown).
On the other hand, if the calculated [0035] voice template 118 does not match any of the retrieved voice templates 116, the user 106 is prompted to speak her passcode again and the above-described process is repeated. After a predetermined number of retries, if the calculated voice template 118 still does not match any of the retrieved voice templates 116, the user 106 is prompted to enter her voicemail box number, and the user is identified and verified as in a conventional access control system.
The user interface also allows the [0036] user 106 to change his passcode and voice template stored in the database record 104. After the user 106 has logged into the system, he can select an option, which allows him to re-record his passcode and voice template 200. Re-recording the voice template is necessary in case the user decides to transfer his account to another user 106 or is encountering a high number of Type I or Type II errors. The interface can also allow multiple users to share the same account and passcode, for example a husband and wife that have a joint credit card. The system can learn to recognize the same passcode spoken by both authorized users of the account. This is convenient for shared account holders in that they both can have the same passcode. Additionally, shared account holders can use different passcodes to access the same account.
In another embodiment, since an utterance is basically an audio signal, any type of utterance can be a passcode. For example, a speaker can spell his first name or sing a song refrain. Features such as pitch, pitch rate change, high frequency captured, glottal waveform, and temporal duration of sound events can be detected for use as a passcode. In fact, any acoustic information, even beyond human hearing ranges, can be used. At least two different signal processors process this audio signal, and each signal processor produces a digital output that represents a different characteristic of the audio signal. For example, a speech recognizer could convert a spoken word or phrase into a set of alphanumeric characters. Each authorized person's record in the database is indexed according to at least two of these distinctive features, and each authorized person's record contains data that represent these at least two characteristics. When a user makes an utterance to gain access to a protected resource, then at least two different signal processors process this utterance and the digital outputs are used to index into the database and retrieve subscriber information. [0037]
To implement this invention, a database record (see FIG. 2) type containing subscriber specific SD information, is created . The database record contains the subscriber's passcode information, a [0038] voice template 200 of a subscriber's voice, used for a comparison with 112, used for a comparison with 112, a subscriber's voicemail box number 202 and any other information 204 necessary to the application. For the present invention, it is preferable to have the records sorted by passcode identifier 114, so that once the passcode has been identified by the SI system as an index its cohort 116 is easily accessed.
In a serial embodiment of the present invention (see FIG. 3), the user speaks [0039] 300 a passcode to gain access to a protected resource. A SI speech-recognition processor converts 302 the spoken passcode into a passcode identifier, a word phrase or number, and then uses the passcode identifier to index 304 into a database of authorized persons and retrieve 306 the cohort 116. Multiple authorized persons can have identical passcodes. A SD voice-recognition processor then derives 308 a voice template from the user's speech, i.e. the spoken passcode, and compares 310 the derived voice template to the retrieved voice templates. The SD voice template is distinctive and has unique characteristics. If the derived voice template matches one of the retrieved voice templates within the confidence limits specified by Type I and Type II errors, the user is deemed 314 to be the authorized person who corresponds to the matching retrieved voice template. If there is no matching voice template in the cohort, then the system can re-receive the spoken passcode or prompt 312 the user to enter his account number manually. Alternatively, step 308 can be performed anywhere in between steps 300 and steps 310.
The serial embodiment described above can also be modified slightly to improve performance by utilizing a parallel embodiment, as shown in FIG. 4. This is similar to the serial embodiment, with [0040] reference numbers 400, 410, 412 and 414 corresponding to reference numbers 300, 310, 312 and 314, respectively. However, the computation 408 of the SI voice template from the spoken passcode is performed simultaneously with the SI conversion 402 from the spoken passcode to the passcode identifier, indexing 404 the cohort in the database 404 and retrieving 406 the cohort from the database.
Using an embodiment of the present invention accessed via telephone, the caller's telephone number (if available to the system via PSTN technology such as caller ID or any other identification information) can be used in a number of ways to improve the system's performance during the [0041] comparison operation 310 and 410. Reducing the number of record comparisons is important because each comparison takes valuable resource time of a host unit running the system and reduces Type II errors since there will be fewer possible false acceptances.
FIG. 5 illustrates one possible embodiment, in which the caller's calling telephone number can be used to reduce the number of comparisons required. After the user enters his passcode, if a record corresponding to the caller's (mailbox, telephone or network-based) number exists [0042] 500 in field 204 of the database record 104, then that record is checked 502 first for a match, saving the time of having to check every record in the cohort. If Type I error is low, the user is accepted. If the caller's number does not match, then the search is broadened to include members of the cohort that have a matching area code or similar geographic area which then can be retrieved 504 and checked 506 for a match. Finally, the rest of the cohort will be retrieved 508 and checked 510 for a match. The system administrator has the ability to enable or disable any of these options according to preferences, defining classes of security services. The final result will return a no-match 512 or the record number of the match 514.
In addition to using the caller's phone number to improve system performance, the caller's phone number can also be used to reduce Type I and Type II errors. When the user's spoken passcode is received and converted to its password identifier, the record in the cohort with the same phone number as the caller's is likely to be the account the caller is accessing. Therefore, when that particular speech template is being compared [0043] 502 to the spoken passcode, the Type I and Type II error thresholds can be adjusted to allow for a greater margin of error, i.e. allow fewer Type I errors and more Type II errors. This will permit accounting for background noises that may be present in the caller's home, or if the caller has a cold and sounds different.
The caller's phone number can also be used to improve security for applications, such as credit card transactions, where security is critical. If a caller makes numerous unsuccessful attempts to access an account, the system will end up returning [0044] 512 a no-match a number of times. At this point the system can store the caller's phone number (if available) and digital copies of his spoken passcode. This way the unauthorized user's own phone number and voice (unknown to him) will be stored in case it is needed later, say by the authorities.
In addition, the system administrator can enable different search methods during the record comparisons, according to the administrator's preferences. Either the system can search the entire cohort and return the record that contains the best match, among several matches, with the spoken passcode, or the system can stop searching the cohort once the first match is achieved. The former method would be more accurate and reduce Type II errors, because even if a match is found, it may belong to another user. Of course, the latter method is faster because it does not have to search the entire cohort, but this speed comes at the expense of more Type II errors. [0045]
FIG. 6 is a flowchart of the process to initialize a record. This includes the user having to speak [0046] 600 passcode, possibly several times, for the SD voice recognition technology to be trained to recognize the user's passcode. The spoken passcode is converted 602 to its voice template. The interface will then convert 604 the spoken passcode to a corresponding passcode identifier (or digital passcode), which is the spoken passcode reduced to a simple computer readable form used as an index. Once the passcode identifier is obtained, the cohort can easily be indexed 606. Then the new record can be added 608 with all the information the system requires.
The training may also include adjusting [0047] 610 the error threshold settings for that user and the other members of the cohort so that a maximum range for error will be allowed while still preserving discrimination between all members of the cohort. For example, consider a cohort with only 2 members, a male and a female. Since their voices will sound very different, a high margin of error can be tolerated. A higher margin of error is preferable because it allows a user to be properly identified even with background noises present or if he speaks in a different tone of voice. Now suppose a third member joins the cohort, with a voice template very similar to another member. A high margin of error is no longer permissible, as the two similar members may get mis-identified. Thus, the margin of error for the entire cohort or for the similar members should be reduced to prevent misidentification. Thus, the adjusting of error levels ensures discrimination between all members of a cohort.
Additionally, each time the system is accessed by a user speaking her [0048] passcode 300 or 400, the training mechanism can be activated in the background to use the spoken passcode to update the speech template stored in that users record, reducing the number of Type I and Type II errors over time.
In an additional embodiment, cohorts are also maintained for callers. This allows numerous additional features, such as allowing the subscriber to retrieve only messages from a selected caller. This embodiment is illustrated in FIG. 7. First, when a caller starts to leave his message, the caller's name is captured [0049] 700. This can be done explicitly, by separately prompting the caller for his name and then his message. The caller's name can also be captured implicitly, by analyzing each spoken word during the first few seconds of the message using SI technology to identify a name, since when a caller leaves a message he customarily will say something like, “Hi this is Bob . . . ”
Then the system uses SI technology to identify the spoken name [0050] 702 (if not already done by the implicit process). Then, similar to the process used to index a subscriber's cohort, the caller's cohort is indexed 704. If more than one of the same name is present, for example if there are two people named “Bob” leaving messages, then SD technology is used to identify the appropriate caller.
As an additional option, the subscriber can designate special handling for certain callers. For example, whenever “Bob” calls, the call can be routed to his cellular phone. When a caller leaves a message, the system will check if the caller is on a subscriber's list for [0051] special handling 706, where it can then take the special action 708.
Finally, the system then stores the message in the appropriate caller's [0052] cohort 710.
FIG. 8 illustrates the process by which the subscriber retrieves messages using the caller's cohort option. First, the subscriber says a name of a caller he wishes to hear messages from [0053] 800. Then, the system uses SI technology to index the caller's cohort 802. If more than one of the names exist (for example, two different “Bobs” have left messages) 804, then the system will play back each of the names spoken by the caller himself so that the subscriber can select which caller he desires 806. Then, all of the messages left by that particular caller are played 808. In addition, the subscriber has the option to designate special handling for this caller 810. For example, all future calls from this caller can be routed to his cell phone.
The caller's cohort option is especially beneficial for the subscriber who receives many calls daily and needs a convenient, automatic way to organize all of his incoming calls. [0054]
Another embodiment Of the present invention comprises a voiceprint identification system, comprising: a database storing a record for each user including a speech template and a passcode identifier: and a processor receiving and converting a spoken passcode spoken by a user into a corresponding passcode identifier and comparing said spoken passcode to each speech template stored in a corresponding subset. [0055]
In a further embodiment, any additional information, biometric or not, may be added to the subscriber record and used to improve confidence. For example, retinal scans may be combined with the present voiceprint access method in order to reduce Type I and Type II errors. [0056]
The described access control system can be utilized with a variety of systems, such as voicemail, credit card verification, building access and automatic teller machine (ATM) systems. For example, the access control system can be used to identify and verify the identity of a telephone caller, such as when the caller attempts to use a credit card to make the call or to make a purchase. The terms and expressions employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. [0057]

Claims

What is claimed is:

1. A voiceprint identification system, comprising:

a receiving device receiving a message from a caller and isolating a name of the caller;

a database storing a record for the caller including a speech template corresponding to the name of the caller and the message; and

a selection device receiving a selected name from a subscriber, and retrieving and playing back messages stored in the database left by a caller with the selected name.

2. The voiceprint identification system as recited in claim 1, wherein the receiving device isolates a name of the caller by separately prompting the caller for his name.

3. The voiceprint identification system as recited in claim 1, wherein the receiving device isolates a name of the caller by using a speech independent processor to analyze a plurality of words spoken for a predetermined period of time at a beginning of the message until locating a name.

4. A method for operating a voiceprint identification system, comprising:

allowing callers to leave a name and a message; and

allowing a subscriber to speak a selected name to retrieve all messages left by a caller with the selected name.