WO2015032876A1

WO2015032876A1 - Method and system for authenticating a user/caller

Info

Publication number: WO2015032876A1
Application number: PCT/EP2014/068877
Authority: WO
Inventors: Marcel KOCKMANN; Javier PÉREZ MAYOS
Original assignee: Voicetrust Eservices Canada Inc.
Priority date: 2013-09-04
Filing date: 2014-09-04
Publication date: 2015-03-12

Abstract

The invention relates to a method for authentication that a caller is a known caller, the method comprising using a previously-enrolled text-dependent model of a portion of speech of the known caller to authenticate that the caller may be then known caller in a text-dependent phase. The invention also relates to a system for authentication of a caller as an enrolled caller, the system comprising: a memory that includes a voiceprint database for storing a previously-enrolled, text-dependent model of a portion of speech of the enrolled caller; and a text-dependent device that is adapted to preliminarily authenticate, in a first, text-dependent phase, the caller, using the previously-enrolled text-dependent model of a portion of speech of the caller.

Description

Method and system for authenticating a user/caller

Cross -Reference to Related Application

This application claims priority from U.S. Provisional Patent Application Number 61/873,590 filed September 4, 2013, which is hereby incorporated by reference in its entirety.

Preamble

The invention relates to a method for authentication that a caller is a known caller, the method comprising using a previously-enrolled text-dependent model of a portion of speech of the known caller to authenticate that the caller may be then known caller in a text- dependent phase. The invention also relates to a system for authentication of a caller as an enrolled caller, the system comprising: a memory that includes a voiceprint database for storing a previously-enrolled, text-dependent model of a portion of speech of the enrolled caller; and a text-dependent device that is adapted to preliminarily authenticate, in a first, text-dependent phase, the caller, using the previously-enrolled text-dependent model of a portion of speech of the caller.

Background of the Invention

In the field of verifying the identity of a caller, e.g., using voice biometrics, typically, there are two common use cases. The first is active caller authentication (ACA) and the second is passive caller authentication (PCA). ACA is fully- automated and uses Interactive Voice Response (IVR) while PCA occurs during conversation of a client with a human, call center agent. Each case, however, requires enrollment of the caller prior to her authentication.

A typical approach for ACA enrollment may occur after a user claim via a dual-tone multi-frequency (DTMF) signaling system or an Automatic Speech Recognition (ASR) system. For example, initial authentication of the user may occur via an approved Know Your Client (KYC) process, e.g., using a user-determined PIN. Subsequently, as part of enrollment, the system may be adapted to prompt the user to repeat a short, fixed, text-dependent passphrase such as "My voice is my passport" several times. Preferably, this enrollment process may take about five to ten seconds of speech.

From the user's utterance(s), a non-invertible model(s) may be generated. The model(s), which is created using selected features extracted from the user's utterances, comprises a "voiceprint," which, like a fingerprint, is presumed to be unique to the discrete user. Alternatively, the various repeated passphrases may be recorded and the recordings may then be saved in a memory for future comparison purposes. However, in many jurisdictions the making of and saving of voice recordings may raise privacy concerns and, therefore, may be regulated.

After enrollment, and a user claim via a DTMF or an ASR system, the system may be adapted to prompt the user to repeat the fixed passphrase once for authentication purposes. Preferably, this requires only about two or three seconds of speech. The system may then accept or reject the user claim by comparing the two or three seconds of recorded passphrase with the previously-enrolled model(s) stored in memory.

Advantageously, ACA is a relatively quick, fully- automated process that uses very short passphrases. It is highly scalable; offers very high accuracy; and provides automatic language selection. Disadvantageously, however, because the passphrase is static, ACA is prone to spoofing attacks using recordings of the user's voice, HMM-based synthesis, voice conversion, and the like.

In contrast, with PCA enrollment, after a user claim via a DTMF or an ASR system, a call center agent who is responsible for the caller engages the caller in a conversation. A portion of this conversation may be a pre-established and approved KYC process, which, typically, may consist of a series of security questions, the responses to which may be recorded. After successful verification of the caller, the agent may authorize an integrated Voice Biometric (VB) system to use the recorded portion of the conversation containing the caller's speech. Preferably, this requires about 30 to 60 seconds of speech. Alternatively, previously- recorded speech stored on a monitoring system may be used for enrollment. However, as previously mentioned, in some jurisdictions, recorded and stored speech raise privacy concerns and may be regulated.

After enrollment, authentication via PCA may require that, after a future user claim via a DTMF or an ASR system, the call center agent again engages the caller in a conversation. During the conversation, the caller's speech is continuously streamed to the VB system that continuously returns and visually displays VB results to the agent's screen. Depending on the agent's experience and confidence level, after reviewing the displayed VB results, the agent can make her decision as to authentication any time during the call. Preferably, this requires about 10 to 30 seconds of speech.

Advantageously, PCA is a natural and convenient process that requires human monitoring and human decision-making. Passive authentication of the caller occurs during a normal conversation with the caller and significantly decreases agent time needed to verify the caller's identity. As a result, PCA is secure against spoofing as the questions asked by a human agent may be random and, consequently, do not lend themselves easily to pre-recorded responses. Disadvantageously, with PCA, mainly due to lexical variability of the speech, a significantly larger volume of speech is necessary. VB system accuracy, however, also tends to be lower.

Accordingly, it would be desirable to provide a secure and highly-accurate system for pure active (IVR-based) and hybrid (active/passive) user voice authentication. Moreover, it would be desirable to provide a system that minimizes the disadvantages of standard ACA and PCA systems, namely spoofing attacks for active, text-dependent systems and low accuracy for passive, text-independent systems when data for comparative purposes are sparse.

It would also be desirable to provide a secure and highly- accurate system that avoids privacy concerns, which may be due to text- independent voiceprints that could be misused, and regulations that may result if the users are enrolled and enrollment requires storing previously- recorded audio utterances attributed to the user. Summary of the Invention

A first aspect of the present invention is a method for authentication of a presumed caller as a known caller. In some embodiments, the method comprises authenticating the presumed caller using a previously-enrolled text-dependent model of a portion of speech of the known caller which occurs in a text-dependent phase. After positively (preliminarily) authenticating the presumed caller during the text-dependent phase, the method further authenticates the presumed caller against himself by prompting the presumed caller to repeat a random phrase(s) and/or engage in conversational speech in a text-independent phase and by performing a lexical-unconstrained, speaker consistency check using results from the text- dependent phase and the text-independent phase. Optionally, the method may further comprise eliminating session variability of the caller's speech during the text-independent phase.

In some variations, the lexical-unconstrained, speaker consistency check includes comparing audio portions of the presumed caller taken during the text-independent phase with selected audio portions of the presumed caller taken during the text-dependent phase, to ensure that the presumed caller during the text-dependent phase is the same presumed caller during the text-independent phase. Advantageously, this feature eliminates the requirement of a text-independent enrollment.

In other variations, authenticating the presumed caller during the text-dependent phase includes processing a passphrase portion contained in the presumed caller's speech and comparing the passphrase portion against the previously-enrolled text-dependent model and performing the lexical-unconstrained, speaker consistency check comprises: extracting selected speech features from the audio data recorded during the text-dependent phase;

comparing the extracted selected speech features with the presumed caller' s utterances during the text-independent phase; and determining whether or not recorded audio data from the text-dependent phase and utterances recorded during the text- independent phase were uttered by a common speaker.

In some implementations, performing the lexical-unconstrained, speaker consistency check includes determining that the caller's speech during the text-dependent phase and either the random speech or the conversational speech during the text- independent phase belong to a common speaker. In another implementation, performing the speaker consistency check includes comparing audio portions of the presumed caller taken during the text- independent phase with audio portions of the presumed caller taken during the text-dependent phase, to ensure that the presumed caller during the text-dependent phase is the same presumed caller during the text- independent phase.

A second aspect of the present invention d includes a system for authentication of a presumed caller as a known caller. In some embodiments, the system comprises memory that includes a voiceprint database for storing a previously-enrolled, text-dependent model(s) of a portion(s) of speech of the known caller; a text-dependent device that is adapted to authenticate the presumed caller using the previously-enrolled text-dependent model of a portion of speech of the known caller, which occurs during a text-dependent phase; and a text-independent, speaker consistency check device that is structured and arranged to further authenticate that the presumed caller authenticated by the text-dependent device phase and a presumed caller making utterances during a text-independent phase are the same caller. In some variations, the text-dependent device is adapted to compare the previously-enrolled text-dependent model of a portion of speech of the known caller to a passphrase uttered by the presumed caller. In other variations, the text-dependent device is adapted to selectively extract text-dependent verification data from the passphrase uttered by the presumed caller.

In other embodiments, the text-independent speaker consistency check device includes memory for storing text-dependent verification data extracted during the text- dependent phase; and, moreover, the text-independent, speaker consistency check device is adapted to compare extracted text-dependent verification data to an utterance(s) of the presumed caller, which occurs during the text-independent phase. Preferably, the text- independent, speaker consistency check device does not require a previously-enrolled text- independent model from the known caller. Indeed, advantageously, this feature eliminates the requirement of a text-independent enrollment.

Brief Description of the Drawings

In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the present invention are described with reference to the following drawings, in which: FIG. 1 shows a diagram of an exemplary embodiment of a system for providing caller authentication using voice biometrics.

FIG. 2 shows a flow diagram of an exemplary embodiment of a method for providing caller authentication using voice biometrics.

Detailed Description of the Invention

A method and a system for providing caller authentication using a hybrid active/passive voice biometrics are disclosed. The method and system combine a highly-accurate, active, text-dependent portion with an active/passive, text-independent portion, which are configured to separate the underlying two-step process into two distinct problems. The first problem is a text-dependent, speaker verification problem in which lexical variability may be eliminated, while optimizing the system to handle environmental intersession variability for a fixed passphrase. In short, only characteristic attributes of the speaker may be used to distinguish the "class," which is to say, the speaker, while unwanted, i.e., nuisance, attributes (e.g., attributes associated with the handset type, handset age, channel, noise, and so forth) may be ignored or otherwise compensated for from the class. The second problem is a text- independent speaker consistency problem in which session variability may be eliminated, while optimizing the system to handle lexical intra-speaker variability. In short, the present system and method use characteristic attributes of the speaker so that a discrete session may be used to distinguish the class, while nuisance attributes (e.g., attributes associated with the content and speaking style) of the recordings are ignored or otherwise compensated for from the class. Advantageously, the embodied system may be adapted to minimize user interaction and, furthermore, may optimally leverage typical IVR and call center structures and processes. Referring to Figure 1, an illustrative embodiment of a system 10 for providing caller authentication is shown. The system 10 may include a text-dependent recognition/ verification device 12, a voiceprint database 16, and a text-independent, lexical-unconstrained speaker consistency check device 18. The functionality of the system 10 can be explained referring to Figure 2, which provides an exemplary method for providing improved user/caller authentication using text-dependent (TD) recognition and text-independent (TI), lexical- unconstrained, speaker consistency checking.

The Method

In some embodiments, in a first step, the process provides a preliminary, active, text- dependent (TD) recognition/verification of the user/caller's identity (STEP 1). The step begins when a user/caller 15 enters, i.e., speaks, her claimed and presumed identity via a DTMF system or an ASR system, e.g., using or via an IVR-based flow, which is part of an initial system challenge (STEP 1A). The text-dependent (TD) recognition/verification device 12 may then prompt the user/caller 15, to orally repeat a passphrase that is identical, i.e., text- dependent, to a previously-enrolled TD model, e.g., a voiceprint, of the enrolled speaker, who the user/caller 15 represents herself to be. Preferably, previously-enrolled TD models of all enrolled speakers may be stored in memory, e.g., in the voiceprint database 16. As or after the user/caller 15 repeats the passphrase, some discrete features of the user/caller's utterance, e.g., TD verification features, may be extracted (STEP IB) and saved (STEP 1C) in a suitable database, e.g., in the memory of the speaker consistency check device 18.

Because the TD recognition/verification process (STEP 1) is IVR-based, i.e., active, the user/caller's repetition of the passphrase or other portion of her speech is processed against, i.e., compared with, the previously-enrolled TD model(s) of the enrolled speaker (STEP 1C). Preferably, data are provided to the TD recognition/verification device 12 by the voiceprint database 16 for the comparison. If the user/caller's speech matches the previously-enrolled TD model(s) of the enrolled speaker, who the user/caller 15 represents herself to be, the user/caller 15 may be subjected to a second, text-independent (TI) phase of the process (STEP 2). On the other hand, if there is no match, then the system 10 may respond (STEP 4) to the lack of recognition/verification, e.g., by dropping the call (STEP 5) altogether or by repeating the text- dependent (TD) recognition/verification phase (STEP 1).

Preferably, the voice biometrics (VB) of the active, TD recognition/verification phase (STEP 1) are optimized for a fixed, text-dependent passphrase, which yields high VB accuracy. Assuming that there is no lexical variability (because the passphrase is always the same), the TD recognition/verification phase (STEP 1) may be further optimized to account for environmental intersession variability, which may be due to the effects of aging, handset issues, channel issues, and the like. Indeed, factoring explicit intersession variability, without lexical variability, into the TD speaker recognition/verification phase (STEP 1) may be advantageous.

After successful authentication during the active, TD recognition/verification phase

(STEP 1), the user/caller 15 may be subjected to a text-independent (TI), lexical-unconstrained, speaker consistency check phase (STEP 2). A purpose of the TI, lexical-unconstrained, speaker consistency check phase (STEP 2) is to ensure that utterances made during the TI, lexical- unconstrained, speaker consistency check phase (STEP 2) and utterances made during the TD phase (STEP 1A) originated from the same speaker. Thus, the TI, lexical-unconstrained, speaker consistency check phase (STEP 2), inter alia, provides added protection against, TD phase spoofing attacks, in which a user/caller 15 may try to fool the system 10, e.g., by playing a recording of the enrolled caller speaking the passphrase, or by using HMM-based synthesis, voice conversion, and the like. In short, the TI, lexical-unconstrained, speaker consistency check phase (STEP 2) may serve the purpose of detecting if incoming audio during the TD verification phase (STEP 1) originates from the same individual who is speaking during the TI, lexical-unconstrained, speaker consistency check phase (STEP 2). Advantageously, the speaker consistency check device 18 performs this with very high accuracy.

Optionally, or in addition to the TI, lexical-unconstrained, speaker consistency check phase (STEP 2) discussed below, the method may include a playback detection step (not shown) in which incoming audio data that are suspected of being a recording may be matched against a database of pre-recorded utterances, for the purpose of ascertaining whether or not the incoming audio data are pre-recorded or not.

One of the many advantages and novel features of the present invention is that, in contrast with the TD recognition/verification phase (STEP 1), a TI model, e.g., a voiceprint, of the user/caller' s speech has not been enrolled previously nor is enrollment required. As a result, in lieu of pre-enrolled TI models, speech features selectively extracted during the TD recognition/verification phase (STEP IB) e.g., TD verification features, may be processed and used for comparison purposes during the TI speaker consistency check phase (STEP 2). In short, asymmetric enrollment of a model on the first test utterance and verification against the second are not performed, and vice versa.

The TI, lexical-unconstrained, speaker consistency check phase (STEP 2) can be either active or passive depending on whether the system 10 is, respectively, a fully- automated, IVR- based system or a call center system. Preferably, the lexical-unconstrained, speaker consistency check phase (STEP 2) includes a hybrid, active/passive system having a passive, call center system and/or an active, IVR-based system. According to either of the active or passive scenarios of the TI, lexical-unconstrained, speaker consistency check phase (STEP 2), once the identity of the user/caller 15 has been preliminarily verified in the TD phase (STEP 1), the user/caller 15 may be connected to an automated IVR system or to a human agent (STEP 2A), either of which will engage the user/caller 15 (STEP 2B) to prompt the user/caller 15 to speak totally random phrases having full lexical mismatch to the TD passphrase, which is to say, the TD verification features.

User/caller responses or discrete features evoked during the active and/or passive questioning or conversation, i.e. , TI input, may be continuously extracted from the user/caller' s utterances (STEP 2C) and provided to the TI, lexical-unconstrained, speaker consistency check device 18. More specifically, the TI, lexical-unconstrained, speaker consistency check device 18 may be adapted to check the uttered random phrases (STEP 2C) for consistency by comparing them (STEP 2D) to any of the saved TD verification features (STEP 1C). Here again, the purpose of the comparison is to confirm with a high degree of certainty that utterances made during each phase of the process originated from the same speaker.

If, after a pre-determined period of time and/or after processing a pre-determined volume of data, the TI, lexical-unconstrained, speaker consistency check device 18 of the system 10 has not found sufficient evidence to deem a match, the system 10 may respond (STEP 4) by dropping the call (STEP 5) altogether or by repeating the text-dependent (TD) recognition/verification phase (STEP 1). Optionally, the extracted feature data recorded during the TD phase (STEP IB) may be integrated into a fraud list of audio data for use in early identification or detection of fraudulent, unauthorized user/callers. If, on the other hand, the system 10 finds a match(es) between the TD verification features (STEP IB) and the TI data (STEP 2C), these data may be fed back to the IVR and/or to the agent (STEP 3), who will at some point determine the authenticity of the user/caller 15.

Assuming that there is no session variability, the TI, lexical-unconstrained, speaker consistency check phase (STEP 2) may be further optimized to account for lexical variability, which may be due to the content and the speaking style of the recordings. Indeed, factoring explicit lexical variability, but without intersession variability, into the TI, lexical- unconstrained, speaker consistency check phase may be advantageous.

The System

As shown in Figure 1, the system 10 includes a text-dependent verification device 12, a voiceprint database 16, and a consistency checking device 18. The function of the devices and database 12, 14, and 18 and their interoperability have been discussed in detail above. In some embodiments of the present invention, each of the devices 12 and 18 may include an organic processing device or a single processing device may operate each of the devices 12 and 18. The invention, further, may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a

communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Each processing device may include a general-purpose computing device in the form of a computer including a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit.

Each processing device may include a variety of computer readable media that can form part of the system memory and be read by the processing unit. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. The system memory may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements, such as during start-up, is typically stored in ROM. RAM typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit. The data or program modules may include an operating system, application programs, other program modules, and program data. The operating system may be or include a variety of operating systems such as

Microsoft Windows® operating system, the Unix operating system, the Linux operating system, the Xenix operating system, the IBM ΑΓΧ™ operating system, the Hewlett Packard UX™ operating system, the Novell Netware™ operating system, the Sun Microsystems Solaris™ operating system, the OS/2™ operating system, or another operating system of platform.

At a minimum, the system memory includes at least one set of instructions that is either permanently or temporarily stored. The processing unit may be adapted to execute the instructions that are stored in order to process data. The set of instructions may include various instructions that perform a particular task or tasks. Such a set of instructions for performing a particular task may be characterized as a program, software program, software, engine, module, component, mechanism, or tool.

The processing device may include a plurality of software processing modules stored in a memory as described above and executed on the processing unit in the manner described herein. The program modules may be in the form of any suitable programming language, which is converted to machine language or object code to allow the processor or processors to read the instructions. That is, written lines of programming code or source code, in a particular programming language, may be converted to machine language using a compiler, assembler, or interpreter. The machine language may be binary coded machine instructions specific to a particular computer.

Any suitable programming language may be used in accordance with the various embodiments of the invention. Illustratively, the programming language used may include assembly language, Ada, APL, Basic, C, C++, COBOL, dBase, Forth, FORTRAN, Java,

Modula-2, Pascal, Prolog, REXX, and/or JavaScript, for example. Further, it is not necessary that a single type of instruction or programming language be utilized in conjunction with the operation of the system and method of the invention. Rather, any number of different programming languages may be utilized as is necessary or desirable.

Also, the instructions and/or data used in the practice of the invention may utilize any compression or encryption technique or algorithm, as may be desired. An encryption module might be used to encrypt data. Further, files or other data may be decrypted using a suitable decryption module.

The computing environment may also include other removable/non-removable, volatile/ nonvolatile computer storage media. For example, a hard disk drive may read from or write to non-removable, nonvolatile magnetic media. A magnetic disk drive may read from or write to a removable, nonvolatile magnetic disk, and an optical disk drive may read from or write to a removable, nonvolatile optical disk such as a CD-ROM or other optical media. Other removable/ non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The storage media are typically connected to the system bus through a removable or non-removable memory interface.

The processing unit that executes commands and instructions may be a general purpose computer, but may utilize any of a wide variety of other technologies including a special purpose computer, a microcomputer, mini-computer, mainframe computer, programmed micro-processor, micro-controller, peripheral integrated circuit element, a CSIC (Customer Specific Integrated Circuit), ASIC (Application Specific Integrated Circuit), a logic circuit, a digital signal processor, a programmable logic device such as an FPGA (Field Programmable Gate Array), PLD (Programmable Logic Device), PLA (Programmable Logic Array), RFID integrated circuits, smart chip, or any other device or arrangement of devices that is capable of implementing the steps of the processes of the invention.

In some cases, relational (or other structured) databases, such as the voiceprint database 16, may provide such functionality, for example as a database management system which stores data related to the services and consumers utilizing the service. Examples of databases include the MySQL Database Server or ORACLE Database Server offered by ORACLE Corp. of Redwood Shores, CA, the PostgreSQL Database Server by the

PostgreSQL Global Development Group of Berkeley, CA, or the DB2 Database Server offered by IBM.

It should be appreciated that the processing units and/or system memories of the processing devices need not be physically in the same location. Each of the processing units and each of the system memories used by the computer system may be in geographically distinct locations and may be connected so as to communicate with each other in any suitable manner. Additionally, it is appreciated that each of the processing units and/or system memory may be composed of different physical pieces of equipment.

A user may enter commands and information into the processing device through a user interface that includes input devices such as a keyboard and pointing device, commonly referred to as a mouse, trackball or touch pad. Other input devices may include a

microphone, joystick, game pad, satellite dish, scanner, voice recognition device, keyboard, touch screen, toggle switch, pushbutton, or the like. These and other input devices are often connected to the processing unit through a user input interface that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).

One or more monitors or display devices may also be connected to the system bus via an interface. In addition to display devices, processing devices may also include other peripheral output devices, which may be connected through an output peripheral interface. The processing devices implementing the invention may operate in a networked environment using logical connections to one or more remote processing devices, i.e., computers, the remote computers typically including many or all of the elements described above.

Various networks may be implemented in accordance with embodiments of the invention, including a wired or wireless local area network (LAN), a wide area network (WAN), a wireless personal area network (PAN), and other types of networks. When used in a LAN networking environment, computers may be connected to the LAN through a network interface or adapter. When used in a WAN networking environment, processing devices typically include a modem or other communication mechanism. Modems may be internal or external, and may be connected to the system bus via the user-input interface, or other appropriate mechanism. Processing devices may be connected over the Internet, an Intranet, Extranet, Ethernet, or any other system that provides communications. Some suitable communications protocols may include TCP/IP, UDP, or OSI for example. For wireless communications, communications protocols may include Bluetooth, Zigbee, IrDa or other suitable protocol. Furthermore, components of the system may communicate through a combination of wired or wireless paths.

Although internal components of the processing devices are not shown, those of ordinary skill in the art will appreciate that such components and the interconnections are well known. Accordingly, additional details concerning the internal construction of the computer need not be disclosed in connection with the present invention.

Claims

1. A method for authentication of a presumed caller as an enrolled caller, the method comprising:

preliminarily authenticating, in a text-dependent phase, the presumed caller using a previously-enrolled text-dependent model of a portion of speech of the known caller;

after preliminarily authenticating the presumed caller during the text-dependent phase, in a text- independent phase, further authenticating the presumed caller; and

performing a speaker consistency check with results from the text-dependent phase and the text- independent phase.

2. The method of claim 1, wherein further authenticating the presumed caller in the text- independent phase comprises prompting the presumed caller to engage in conversational speech.

3. The method of claim 1 or 2, wherein further authenticating the presumed caller in the text-independent phase comprises prompting the presumed caller to repeat a random phrase.

4. The method of one of the foregoing claims, further comprising providing playback detection in which incoming audio data that are suspected of being a recording are matched against a database of pre-recorded utterances, to ascertain whether the incoming audio data are pre-recorded.

5. The method of any of the foregoing claims, wherein authenticating the presumed caller during the text-dependent phase includes processing a passphrase portion contained in the presumed caller's speech and comparing the passphrase portion against some portion of the previously-enrolled text-dependent model.

6. The method of any of the foregoing claims, further comprising eliminating lexical variability of the caller's speech during the text-dependent phase.

7. The method of any of the foregoing claims, wherein performing the speaker consistency check comprises: extracting selected speech features from audio data recorded during the text- dependent phase;

comparing the extracted selected speech features with the presumed caller' s utterances during the text-independent phase; and

determining whether recorded audio data during the text-dependent phase and utterances recorded during the text-independent phase were spoken by a common speaker.

8. The method of any of the foregoing claims, further comprising eliminating session variability of the caller's speech during the text- independent phase.

9. The method of any of the foregoing claims, wherein performing the speaker consistency check comprises determining that the caller' s speech during the text-dependent phase and the random speech during the text- independent phase were spoken by a common speaker.

10. The method of any of the foregoing claims, wherein performing the speaker consistency check, during the text-independent phase, comprises using a random phrase that is lexically unconstrained.

11. The method of any of the foregoing claims, wherein performing the speaker consistency check with results from the text-dependent phase and the text-independent phase does not require a specific text- independent enrollment.

12. The method of any of the foregoing claims, wherein performing the speaker consistency check includes determining that the caller's speech during the text-dependent phase and the conversational speech during the text-independent phase were spoken by a common speaker.

13. A system for authentication of a caller as an enrolled caller, in particular for executing the method according to one of the foregoing claims, the system comprising:

memory that includes a voiceprint database for storing a previously-enrolled, text- dependent model of a portion of speech of the enrolled caller; a text-dependent device that is adapted to preliminarily authenticate, in a first, text- dependent phase, the caller, using the previously-enrolled text-dependent model of a portion of speech of the caller; and

a text-independent, speaker consistency check device that is structured and arranged to further authenticate, during a text-independent phase, that the caller authenticated by the text-dependent device and utterances made by the caller during the text-independent phase are the same individual.

14. The system of claim 13, wherein the text-dependent device is adapted to compare the previously-enrolled text-dependent model of a portion of speech of the enrolled caller to a passphrase uttered by the caller.

15. The system of claim 14, wherein the text-dependent device is adapted to selectively extract text-dependent verification data from the passphrase uttered by the caller.

16. The system of claim 15, wherein the text- independent, speaker consistency check device includes memory for storing the extracted text-dependent verification data.

17. The system of one of claims 13-16, wherein the text-independent, speaker consistency check device does not require a previously-enrolled text-independent model from the enrolled caller.

18. The system of one of claims 13-17, wherein the text-independent, speaker consistency check device is adapted to compare text-dependent verification data extracted by the text- dependent device to an utterance of the caller during the text-independent phase.