WO2020192890A1 - Systems and methods for speaker verification - Google Patents

Systems and methods for speaker verification Download PDF

Info

Publication number
WO2020192890A1
WO2020192890A1 PCT/EP2019/057475 EP2019057475W WO2020192890A1 WO 2020192890 A1 WO2020192890 A1 WO 2020192890A1 EP 2019057475 W EP2019057475 W EP 2019057475W WO 2020192890 A1 WO2020192890 A1 WO 2020192890A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voiceprint
speech
active
word
Prior art date
Application number
PCT/EP2019/057475
Other languages
French (fr)
Inventor
Themos STAFYLAKIS
Petr MIZERA
Dimitris VASSOS
Original Assignee
Omilia Natural Language Solutions Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Omilia Natural Language Solutions Ltd. filed Critical Omilia Natural Language Solutions Ltd.
Priority to PCT/EP2019/057475 priority Critical patent/WO2020192890A1/en
Publication of WO2020192890A1 publication Critical patent/WO2020192890A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification

Definitions

  • the present application relates to systems, devices, apparatuses and methods of using voice biometrics. More particularly, the application relates to voice biometrics applied to live natural speech and employing context aware models to verify users.
  • Computing systems as well as, devices, apparatuses and services often have a controlled access mechanism for regulating user access to their resources and data. These mechanisms may be implemented in hardware or software or a combination of the two.
  • the most commonly used controlled access mechanism is the use of credentials, often in the form of a username and password duet.
  • more complex mechanisms may be employed where the user must be in possession of a physical token (e.g. token generation device for e-banking code entry), key, or card (e.g. a bank card for use at Automated Teller Machine (ATMs), etc.
  • a physical token e.g. token generation device for e-banking code entry
  • key e.g. a bank card for use at Automated Teller Machine (ATMs), etc.
  • ATMs Automated Teller Machine
  • Biometric (VB) systems in fingerprint reading, iris scanning, voice identification, face identification, etc. Their adoption and performance have various degrees of success. Some issues related to their success is the varying level of correct user identification as a result of environmental conditions (e.g. poor lighting, noise, multiple concurrent speakers), robustness to users pretending to be someone else or using duplicates of the real user’s characteristics to fool the identification/verification system (e.g. playback of the target user’s voice, showing a photo of the target user’s face or fingerprint), simplicity of use, and necessity for special User Interface (UI) hardware.
  • UI User Interface
  • voice based user verification has been gaining acceptance as either a standalone verification mechanism or in combination with other mechanisms.
  • Existing systems often perform either speech recognition or voice recognition, but not both processes at the same time.
  • Speech recognition identifies particular words that are spoken, while voice recognition identifies the user or speaker that is speaking.
  • voice biometrics i.e. processing of certain voice characteristics
  • This mechanism can operate as“user identification” where the system identifies a user based on his voice, without previously knowing who the user is or without the user claiming and identity, and as“user verification” where the claimed or estimated identity of the user is known and the system verifies that the claimed or estimated identity is correct or not.
  • Some of these systems maintain voiceprint databases and when the speaker utters any phrase in any of the above scenarios, a "live" voiceprint, or more than one, are created and compared against the corresponding voiceprints stored in the database. Following correct verification (and in some cases identification) of the user from his voiceprint, an action is taken by the system, usually granting access to the said user to a system or service.
  • Variations of the voiceprint matching mechanisms include adaptation of relevant thresholds, statistical processing, and use of multistage matching with an initial match providing a group of potential users from those whose data are stored in the database, and subsequent stage(s) refining the matching to get the most probable match.
  • NLU Natural Language Understanding
  • An innovative solution is presented for speaker verification using voice biometrics on live natural speech or dialogues of arbitrary duration and content.
  • the proposed speaker biometric verification is based on the notion of Active Words (AW) where the speaker’s live speech is analyzed and compared against stored voiceprints.
  • AW Active Words
  • the present innovative solution analyzes the uttered speech to identify words, classified as“active words”, that is, words that are frequently used under a specific context (e.g. a financial transaction).
  • the present solution searches for and compares them with stored active word voiceprints for the same speaker, previously derived during past interactions of the user with the system, and created from a similar model from the previous interactions of sets of users with the system.
  • the use of Active Words solves the problem encountered in voice biometrics where very short speech segments are not enough to provide an accurate user identification and where the user’s interaction routine with the system has to be changed.
  • a speaker initiates a communication session with the VB system, which transcribes the speaker’s voice to text and identifies uttered words.
  • the system captures and uses session related data to pre-identify the user and then limit the amount of calculations for the similarity scores between the Active Words live voiceprints and only a subset of the archived voiceprints, which belong to the same user.
  • session related data to pre-identify the user and then limit the amount of calculations for the similarity scores between the Active Words live voiceprints and only a subset of the archived voiceprints, which belong to the same user.
  • the system identifies the current context, where the content and/or context is derived from transcribed speech analysis and/or identification data relating to the specific system application and user session. Transcribed words are analyzed to select frequently used words (AW) and to produce a speaker live voiceprint for the AW and compared to one or more stored voiceprints for the same AW of the same speaker.
  • AW frequently used words
  • the stored voiceprint(s) has been constructed during enrolment and may have been updated during previous user sessions.
  • the stored voiceprint(s) may be stored in a compact way by exploiting correlations inside the voiceprint and with AW models representing average representations of uttered active words (and other words) of a set of system speakers-users. The system then creates similarity scores with the speaker’s archived voiceprints for the identified active words.
  • a text-independent speaker identification or verification model is also included, which extracts one voiceprint per utterance.
  • the similarity score provided by the text-independent model is combined with the similarity scores provided by the AW-based model using score-level fusion techniques.
  • the present innovative solution and all its exemplary aspects allow the speaker to speak in natural language, without the need to change his pace, complexity, intonation, or any other characteristic of his speech. They also eliminate the need (as in prior art) for the user to enroll to the system the speaker’s uttering of enrolment words and phrases or repeating predefined words and phrases every time they are to be identified. Also, the present innovative solution eliminates the need to use long uttered phrases thereby avoiding altering user interaction pattern and unnecessary dialogue turns with the system to extend the user’s utterances and allow the system to accurately verify them. As a result, simple, faster and more accurate voice biometric verification is achieved.
  • One aspect of the invention includes a system for performing speaker verification.
  • the system includes an interactive voice recognition (IVR) module arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
  • the system includes a datastore arranged to store a group of active words associated with the interaction context where each active word is selected based on one or moreselection criterion derived from conversations of a population of users.
  • the datastore also stores the first user identifier and a plurality of first user voiceprints derived from pre captured audio of the first user where each first user voiceprint corresponds to each active word of the group of active words.
  • the system also includes an automated speech recognition (ASR) module arranged to perform speech recognition of the first user audio provided during the speech conversation, where the ASR module converts the first user audio including a plurality of captured audio elements into transcribed text including a
  • ASR automated speech recognition
  • the system further includes a voice biometric (VB) module arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based on one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user voiceprint i
  • the similarity score is based on the closest co mparison of one of the captured voiceprints with one of the first user voiceprints.
  • An audio element may include at least one of a phoneme, syllable, word, subword, or phrase.
  • a text element may include at least one of a word, subword, and phrase.
  • Speech recognition may be implemented using at least one speech recognition model including vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition.
  • HMMs Hidden Markov models
  • LSTM Long short-term memory
  • acoustic modeling language modeling
  • Gaussian mixture model a Gaussian mixture model
  • the one or more of the comparisons of each captured voiceprint with each first user voiceprint may include at least one voice recognition model selected from the group of vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back off model, Neural networks, Deep feedforward and recurrent neural networks, Long short term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
  • HMMs Hidden Markov models
  • LSTM Long short term memory
  • acoustic modeling language modeling
  • Gaussian mixture model a Gaussian mixture model
  • the one or more selection criterion for active words includes at least one selected from the group of frequency of use, type of word, amount of phonemes in a word, combination of phonemes, and amount of syllables in a word or phrase.
  • the interaction context may include a type of interaction that the first user has with the IVR.
  • the interaction context may include a banking application, a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR-based application that enables users to interface with a service or product provider.
  • a server configured to perform speaker verification includes a communications interface arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
  • the communications interface may be arranged to receive, from a datastore, a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users.
  • the communications interface may receive, from the datastore, the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
  • the server also includes a processor arranged to perform speech recognition of the first user audio provided during the speech conversation, where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements.
  • the processor is also arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii)
  • a further aspect includes a method for performing speaker verification including performing a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
  • the method includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users and receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
  • the method further includes: performing speech recognition of the first user audio provided during the speech conversation, where a processor converts the first user audio including a plurality of captured audio elements into transcribed text including a
  • the above aspects should not be considered directed to an abstract idea. Instead, the above aspects should be considered directed to an Internet-centric problem or improvement of computer technology related to more efficient voice or speaker recognition that advantageously reduces memory and processing demands on a voice biometric system.
  • a voice biometric system is able to more efficiently focus on a subset of user audio or speech.
  • the system also advantageously combines speech recognition with voice recognition in a technically innovative way to enable rapid identification of the active words using speech recognition to which voice recognition then applied.
  • the above aspects could involve an abstract idea, the aspects are not directed to that idea standing alone.
  • a long-standing problem with voice recognition is how to quickly, efficiently, and reliably verily the identity of speakers.
  • the above aspects are directed to technically improving the speed, efficiency, and reliability, while reducing the cost in processing and memory of speaker recognition.
  • FIG.l shows a schematic diagram of a system implementing aspects of the present inventions.
  • FIG.2 shows basic speaker verification using natural speech as performed.
  • FIG.3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using Active Words.
  • FIG.4A shows a first example of a database storing speaker voiceprint and related information.
  • FIG.4B shows an example of a database storing Active Words speaker voiceprint and related information.
  • FIG. 4C illustrates a database associating users with user identifiers.
  • FIG. 4D illustrates a database associating active words with various user population contexts.
  • FIG. 4E illustrates a database that associates voiceprints of active words by various users for a particular interaction context associated with an IVR system.
  • FIG.5 shows the basic hardware architecture of an application server.
  • FIG.6 shows the basic software components running on an application server.
  • the word“exemplary” is used herein to mean“serving as an example, instance, or illustration”. Any aspect, imp lementation, or configuration described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects, implementations, or configurations. [0041] The acronym“IYR” is intended to mean“Interactive Voice Response”.
  • the acronym“DM” is intended to mean“Dialogue Manager”.
  • the acronym“PSTN” is intended to mean“Public Switched Telephone
  • VAD Voice Activity Detector
  • mobile device may be used interchangeably with“client device” and“device with wireless capabilities”.
  • a mobile device may include, without limitation, a cellular telephone, a mobile handset, a personal digital assistant, a wireless tablet, a wireless computing device, and the like.
  • “ordinary user” and“speaker” It may also be used to mean“caller” in a telephone or VOIP call or conferencing session,“user of an application” or“user of a service”, and“participant” in a text chat, audio chat, video chat, email, audio-conference or video-conference session.
  • “system” may be used interchangeably with“device”,“apparatus”, and“service”, except where it is obvious to a reader of ordinary skill in related art that these terms refer to different things, as this is apparent by the context of the discussion in which they appear. Under any circumstance, and unless otherwise explicitly or implicitly in the description, these four terms should be considered to have the broadest meaning i.e. that of encompassing all four.
  • the present invention treats the problem of identifying a speaker using voice biometrics applied to natural language, free speech dialogue of the speaker with a system, device, apparatus, or service. It aims to present a solution to many speaker verification problems, comprising avoiding having the speaker to dictate predefined text extracts, repeat text prompts, alter his intonation, use simplified speech, emphasize predefined keywords, use non-verbal means of identification, or add unnecessary complexity in the identification process. It also targets the use of fully automated systems, apparatuses and devices which support natural language processing and understanding, compatibility with existing
  • IVR Interactive Voice Response
  • DM Dialogue Manager
  • ASR Automatic Speech Recognition
  • FIG.l shows a schematic diagram of a system implementing aspects of the present inventions.
  • a speaker can use the identification service via any type of device, comprising analogue telephone 10, digital telephone or mobile (cellular) telephone or smartphone 20, or computer or laptop or tablet 30 with voice capture and audio playback capabilities. Regardless of the device or apparatus of choice, the speaker’s speech is converted into either a voice signal (if an analogue telephone 10 is used) or a data stream and sent to 40 a Public Switched Telephone Network (PSTN) or one of Public Land Mobile Network (PLMN), Internet, Private Network, Cloud, respectively.
  • PSTN Public Switched Telephone Network
  • PLMN Public Land Mobile Network
  • the voice signal is converted to data (not shown in FIG.l) by a server and then fed to the back-end portion of the system, through an optional firewall 50 for enhanced security of personal and corporate data.
  • an optional Cache and/or Proxy Server 60 handles them.
  • the inclusion of such a server or servers is optional and can serve further enhancing data security in a mode analogous to secure browsing by isolating an Application Server 70 and the data stored in a Database 80 so that a potential unauthorized intruder can access and modify only a copy of these data and not the original sensitive data. It can also serve as a faster means of accessing commonly used data that may be dispersed to a number of servers and databases as opposed to a single server 70 and database 80.
  • a speaker is verified based on his natural speech.
  • FIG.2 shows basic speaker verification using natural speech as performed in prior art.
  • the method starts with the speaker initiating a communication session 100 with the exemplary system of FIG.l (or a VB server).
  • the user may initiate a telephone session using an analogue telephone 10, a digital (Voice Over Internet Protocol (VOIP)) phone or mobile phone or smartphone 20 or tablet, laptop, or desktop computer 30.
  • VOIP Voice Over Internet Protocol
  • the telephone session may also be implemented as a teleconferencing - multi-party session.
  • the speaker may use a software application running on his smartphone 20 or tablet, laptop, or desktop computer 30. This software application may provide VOIP telephone or teleconferencing - multi-party session, video call, video conferencing, text or voice or video chat session, e-mail session, or a combination of the preceding.
  • the devices 10-30, apparatuses, subsystems and applications perform no processing other than forwarding voice signals or first digitizing voice signals, converting them into data and then forward them via any of the networks 40 shown in FIG.l to back-end servers 60, 70, and receiving voice signals or data representing voice signals for play back to the speaker.
  • the device chosen by the speaker captures his live voice 105.
  • the back-end server or servers 60, 70 receive the voice data corresponding to the user’s live natural speech and calculate his live voiceprint 130.
  • a voiceprint is a set of measurable characteristics of a human voice that can be used to uniquely identify the speaker. These characteristics, which are based on the physical configuration of the speaker’s mouth and throat, can be expressed as a mathematical formula or as a vector.
  • the voiceprint can be calculated with any algorithm and contain any number of features.
  • the voiceprint can be calculated at a backend server by creating an i-vector (or an x-vector or other representation), i.e.
  • a first similarity score is calculated 150 for each pair of the speaker’s live voiceprint and each fetched voiceprint and the pair with the largest similarity score is selected 155 along with the corresponding similarity score. This similarity score is then compared to a predefined threshold T1 160. If this score is greater than or equal to Tl, then the speaker is correctly identified 170. If not, then the speaker identification has failed and the system may, optionally, prompt the speaker to repeat 180 what he just uttered or may start a dialogue to request clarifications or additional information that can be used to correctly verify the speaker against the partially-identified speaker from the call ID and other available data.
  • Failure of voice recognition and/or speaker verification may be due to lack of a stored voiceprint corresponding to this particular speaker, or due to noise in the voice signal, changes in speaker’s voice as a result of e.g. a sore throat, aging if the speaker has not used the user identification service for a long time, or other reasons beyond the scope of this invention.
  • the system cannot identify the speaker unless the speaker provides additional information beyond this used for voice biometrics.
  • the repetition of the uttered speech by the speaker, or his participation in a natural speech dialogue with the system could lead to a correct identification result.
  • This method produces acceptable results assuming the speaker’s voiceprint is among those stored at the backend servers 60, 70 and/or the database 80 and the live speaker’s uttered speech has a sufficiently long duration (typically several seconds) for a useful live voiceprint to be constructed..
  • the database is populated with voiceprints corresponding to speaker of the system. This database population process may be repeated during the user’s new interactions with the system or for new speakers-users.
  • Manual intervention of an operator or support staff may optionally be used.
  • Real time operation and response times depend on the processing power of the application server 70 undertaking the processing, which in turn depends on the amount of stored voiceprints.
  • the number of potential speakers (and consequently the corresponding stored voiceprints) is large, in the order of hundreds of thousands or millions, the delays introduced for processing all these data and verify the speaker can be beyond what users may consider acceptable. For this reason, the step of reducing the set of stored voiceprints that are used in the VB calculations is essential to reduce the stored voiceprint set against which to compare the live speakers voiceprint.
  • the reduced stored voiceprint set may still be comprised of significantly different phonemic content as compared to that of the live voiceprint, making an effective comparison and subsequent scoring sub-optimal.
  • the live voiceprint data is comprised of at least a minimum duration of speech, which ensures that the phonemic content of the live voiceprint will be enriched enough so as to statistically be not significantly different from the stored voiceprint phonemic content.
  • FIG.3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using AW.
  • the innovative Voice biometric system 1100 comprises an Interactive Voice Response (IVR) module, responsible for performing live speech conversation with a live speaker 1101, based on uttered questions and replies.
  • IVR Interactive Voice Response
  • IYR 105 The speech communication between speaker 1101 and IYR 105 is done according to what is described in FIG.l.
  • IVR 1105 is connected to an Automatic Speech Recognition (ASR) module 1120, which ASR 1120 analyzes uttered speech and transcribes the speech into text.
  • ASR Automatic Speech Recognition
  • IVR 1105 streams voice received from speaker 1101 to ASR 1120 together with identification data.
  • Identification data may include, without limitation, a unique subscriber identifier such as an International mobile subscriber identity (IMSI), a mobile identification number (MIN), a mobile subscription identification number (MSID or MSIN), temporary mobile subscriber identity (TMSI), Mobile Subscriber ISDN Number (MSISDN), Mobile Station International ISDN Number, Mobile International ISDN Number, Subscriber Number (SN), and/or a Packet temporary mobile subscriber identity (P-TMSI), a unique electronic serial number (ESN), a mobile device identifier, a mobile equipment identifier (MEID), an International Mobile Equipment Identifier (IMEI), a media access control (MAC) address, Android ID, a Unique Device Identifier (UDID), Universally Unique Identifier (UUID), a Core Foundation Universally Unique Identifier (CFUUID), a globally unique identifier (GUID), an OpenUDID, a SecureUDID, an unique Identifier (UIDevice), LTE static IP address (UEIP), Tracking Area Identity (TAI), Temporary
  • ASR 1120 receives the streamed speech and identification data and uses an acoustic model and optionally a language model (both stored locally at ASR 1120, or at a local database, or at the cloud) to identify phonemes, syllables, words and sentences in the speaker’s 1101 speech.
  • UEID User Equipment Identifier
  • ASR 1120 uses an acoustic model and optionally a language model (both stored locally at ASR 1120, or at a local database, or at the cloud) to identify phonemes, syllables, words and sentences in the speaker’s 1101 speech.
  • the acoustic and language models used by ASR 1120 may be produced by the proposed innovative system. In some implementations, either one or both of the two models may be imported from other applications and systems, external to system 1100, where their creation involves analyzing speech files and their transcribed text content and timing information, and words, phrases, and their relationships in a given language, respectively.
  • Speech recognition may include automatic speech recognition (ASR), computer speech recognition, and/or speech to text (STT).
  • Speech recognition and/or voice recognition models herein may include, without limitation Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition.
  • HMMs Hidden Markov models
  • LSTM Long short-term memory
  • acoustic modeling language modeling
  • Gaussian mixture model a Gaussian mixture model
  • the transcribed speech may optionally be fed to a Natural Language
  • the NLU uses a Semantic Model of the used language and creates metadata that describe what the words in the transcribed text mean. For instance, the phrase“how much money did I spend at Starbucks last month” is processed by the NLU, which assigns the tag“merchant” to“starbucks” and the tag“date” to“last month”.
  • the semantic model is independent and unrelated of the current invention and its aspects and can be imported from other systems or services.
  • ASR 1120 continues by streaming speech, transcribed speech (i.e. text) and control data (e.g. word and/or letter and/or sentence boundaries and tags identifying word and/or letter and/or sentence boundaries in the streamed speech) to Voice Biometric module 1130, so as to allow association of the streamed (i.e. uttered) speech with its content (i.e. transcribed text) implementations, the alignment of words may be done by the ASR module 1120 or the VB module 1130.
  • control data e.g. word and/or letter and/or sentence boundaries and tags identifying word and/or letter and/or sentence boundaries in the streamed speech
  • Voice Biometric module 1130 so as to allow association of the streamed (i.e. uttered) speech with its content (i.e. transcribed text) implementations, the alignment of words may be done by the ASR module 1120 or the VB module 1130.
  • VB 1130 receives the speech, text and control data and uses them to identify speaker 1101. To reach an identification result, VB 1130 analyzes the transcribed text and control data to deduce the content and/or context of speech (e.g. a request to get the balance of a bank account). In some configurations, VB 1130 communicates with an NLU module (not shown in FIG.3, to receive an estimation of the context of the speech as this context is derived from the content of the transcribed speech by analyzing the natural language using a language model and rules to understand the conveyed meaning). [0073] Using the content and/or context of the speech, YB 1130 checks a lookup table
  • a selection criterion may be frequency of use.
  • Other selection criteria may be used such as, without limitation, type of word, amount of phonemes in a word, combination of phonemes, and/or amount of syllables in a word or phrase.
  • a selected word is referred to as an“Active Word” (AW).
  • An AW is used by VB 1 130 to improve the performance of speaker 1 101 verification.
  • VB 1 130 uses AWs to limit the search of speaker voiceprints stored in a database 1 140, accessed by VB 1 130, to only those speaker voiceprints that corresponds to AWs, and thus improve the phonemic content match between the live speaker voiceprint and the stored speaker voiceprints.
  • This process of matching the phonemic content of live and stored speaker voiceprints, using active words alleviates the requirement for long speaker samples and long utterances, and allows the VB 1 130 to perform an accurate voiceprint scoring with minimal live speech samples.
  • This search limiting action also improves database access times as hundreds or thousands of voiceprints may be stored for the same speaker, and more importantly significantly reduces the processing needs to compare the live speaker’s speech voiceprint (computed by VB 1 130 in real time) with the voiceprints in database 1 140. As a result the number of comparisons is reduced from a scale of hundreds or thousands of comparisons to merely a few dozens or even less comparisons.
  • voiceprints may, in some implementations, be stored together with metadata describing their content and/or context.
  • metadata may be created using speech recognition and natural language understanding analysis, as well as speaker intent analysis. They may also be captured by analyzing speaker interaction with the system, or use of an application running at a user or speaker’s computing device 20 or30 which may provide a multimodal (voice, text, graphics, video, chat, email) user interface, while enabling all voice processing at the application server 70.
  • Each of the stored voiceprints may correspond to a word, phrase, or paragraphs uttered by a user at an earlier session with the system. For instance,“my name is John” creates a different voiceprint from“thank you”. This is not only due to the different words in these two phrases but also due to the different phonemes, and their combinations, leading to different intonations and linking between them, and so on. These characteristics may have an effect on the number of correct user identification results and failures, especially in the presence of noise. For instance, if the live user’s voiceprint has a different context (e.g. is derived from different words or phonemes) than the stored voiceprints selected according to the pre-identification result then the live and the stored voiceprints may not be similar enough to correctly identify the user and lead to a failure or false negative result.
  • a different context e.g. is derived from different words or phonemes
  • the choice of the amount of AWs can be based on their frequency of occurrence and on AW models generated by VB 1130 and stored in an AW database 1160. These AW models may be created as an average model from the recorded speech of all or a subset of the speakers of system 1100, i.e., a user or speaker population.
  • Example active words may include“balance”,“account” and“please” 1160. In this example, two active words are defined for the selected context.
  • VB 1130 could have selected more AWs or even a single AW.
  • the choice of the number of AWs may be based on performance metrics (e.g.
  • a selection by the system administrator may be structured to reflect first the frequency distribution of the words in a collection of different speakers’ speech, or of the same speaker 1101 in past uses of the system for the same context.
  • a balance between the size of the set of AWs and VB 1130 performance may be calculated in real time by VB 1130, which may periodically adjust the size of the AW set for speaker 1101 and for the specific context in consideration.
  • a system administrator may define the size of the AW sets.
  • VB 1130 compares the live user’s speech voiceprint with the AW speaker’s voiceprint(s) for the detected content and/or context, stored in database 1140, and produces a first similarity score for each AW speaker’s voiceprint.
  • comparison is done in a vector space, where both the live speaker’s speech voiceprint with the selected AW speaker’s voiceprint(s) are i-vectors (or x-vectors, or other representation).
  • i-vectors may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
  • VB 1130 uses the first similarity scores of the comparison results and selects the stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T1 and if the first similarity score is equal or exceeds Tl, then the speaker is correctly verified.
  • AWs enables the word-by-word (or even subword-by- subword) comparison between the live user’s speech and the stored (e.g. in i-vector form) user’s sample for each AW, producing more accurate results at a fraction of the processing time required without AWs, and using live speech segments of very short duration.
  • AWs can provide accurate identification results using live speech of only a single word or of word segments (e.g.
  • System 1100 is managed by a Dialogue Manager (DM) module 1110 which controls the operation of all other modules 1105, 1120, 1130, 1150.
  • DM 1110 is controlled by an application developer or system administrator and exchanges control data with the other system components.
  • DM 1110 controls the operation of IVR 1105 to start and perform a dialogue with speaker 1101 in order to get his request to be serviced by system 1100 (e.g. to provide the speaker with the balance of his bank account), or to request further information (e.g. which account, credit or savings account), or to prompt the speaker to get more speech input in cases where an accurate speaker identification or verification is not possible using the already uttered speech by the speaker (e.g. very noisy sample, truncated, etc.).
  • Communication between IYR 1105 and DM 1110 may be done using any publicly available or proprietary protocol.
  • DM module 1110 also communicates with VB module 1130 to initiate and control a VB session using the streamed and transcribed speech and identification data received at VB module 1130 from ASR module 1120.
  • VB 1130 signals DM 1110 that speaker identification is not possible with high accuracy using the available live speech specimen.
  • DM 1110 may then instruct VB 1130 to store the initial speech, text and identification data and initiate a second round of identification, also involving IVR 1105 to get better scoring that will increase the accuracy of user identification. If identification is achieved, the result is signaled by YB 1130 to DM 1110 and DM 1110 signals all system modules to end current session. DM 1110 may then output the
  • a speaker calls into system 1100 listens to a first speech prompt from IVR
  • DM 1110 receives identification data (e.g. Automatic Number Identification
  • ANI ANI
  • caller ID CLASS
  • MAC address CLASS
  • the speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner’s consent or without it).
  • DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
  • VB 1130 initializes a new (empty) voiceprint, receives the utterance transcription from ASR 1120, identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, creates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and returns back to DM 1110 a control message (e.g. voiceprint _does_not_exist).
  • DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker.
  • the speaker in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120.
  • ASR 1120 then relays the streamed speech to VB 1130.
  • the streamed speech is sent by IVR 1105 to DM 1110, and it is DM 1110 that relays the streamed speech to VB 1130.
  • ASR 1120 produces a transcription of the second utterance, which is sent to
  • VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint.
  • sample quality is not good (e.g. one or more quality metric is below a corresponding threshold)
  • sample quality is not good (e.g. one or more quality metric is below a corresponding threshold)
  • sample is discarded and a“badjsample” response is returned by VB 1130 to DM 1110.
  • VB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and also stores the AW speech samples (that have been segmented and isolated from the speaker’s speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140.
  • the process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW.
  • the collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105, or during several interactions.
  • speaker voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140. So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of user (full not partial) identification than VB (e.g. key-in or utter a password).
  • a“voiceprint creation failure reason” e.g.
  • the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a“voiceprint_creation_success” message is sent by VB 1130 to DM 1110.
  • a speaker calls in system 1100, listens to a first speech prompt from IVR
  • DM 1110 receives identification data (e.g. Automatic Number Identification
  • ANI ANI
  • caller ID MAC address
  • MAC address MAC address
  • the speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner’s consent or without it).
  • DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
  • VB 1130 returns back to DM 1110 a
  • DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker.
  • the speaker in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120.
  • ASR 1120 then relays the streamed speech to VB 1130.
  • the streamed speech is sent by IVR 1105 to DM 1110, and it is DM 1110 that relays the streamed speech to VB 1130.
  • ASR 1120 transcribes the utterance and forwards it to VB 1130 (directly or via
  • VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint. If the sample quality is not good (e.g. one or more quality metric is below a corresponding threshold), the sample is discarded and a“bad_sample” response is returned by VB 1130 to DM 1110.
  • YB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to each AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and also stores the AW speech samples (that have been segmented and isolated from the user’s speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140.
  • the process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW.
  • the collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105, or during several interactions.
  • voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140. So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of speaker (full not partial) identification than VB (e.g. key-in or utter a password).
  • a“voiceprint _creation failure j-eason” e.g.
  • the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a“voiceprint j creation_success” message is sent by VB 1130 to DM 1110.
  • System 1100 may subsequently perform Voice recognitions and/or speaker verifications by comparing voiceprints of captured AWs during a user call with stored user- specific voiceprints associated with AWs.
  • the voiceprints may be created using a model that creates i-vectors (or x-vectors or other representation).
  • VB 1130 may compare the live user’s speech voiceprint with the stored AW speaker’s voiceprint(s) for the detected content stored in database 1140 and/or 460, and produce a first similarity score for each voiceprint associated with the user.
  • VB 1130 may perform comparisons in a vector space, where both the detected live speaker’s or user’s speech voiceprint with the selected AW speaker’s voiceprint(s) are first converted into an i-vector (or x-vector, or other representation). Then, i-vectors (or other representations) may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
  • VB 1130 uses the first similarity scores of the comparison results and selects a stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T1 and if the first similarity score is equal or exceeds Tl, then the speaker is correctly verified.
  • VB 1130 may be employed by VB 1130 to determine similarity scores based on comparing a voiceprint from detected live AW with a stored voiceprint associated with an AW of a user.
  • a single uttered speech is used to identify the speaker and no second or subsequent uttered speech segments are needed.
  • FIG. 4A shows a first example of a database storing speaker voiceprint and related information.
  • the database 700 stores entries corresponding to speakers of the speaker identification service.
  • a first speaker entry 710 corresponding to a first speaker, which entry may comprise random speech (e.g. speaker name, debit account, and card number. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”,“Debit Account”,“Card Number”,“1237890”,“6543”).
  • the first speaker entry is also associated with a stored Voiceprint 1 715 and metadata describing the content and context of the voiceprint or other information associated with the first speaker.
  • the content of the speech may be either random speech, requiring text-independent voice biometrics or passphrase-specific (e.g.,“my voice is my password”), requiring text- dependent voice biometrics.
  • the present disclosure describes systems and methods that advantageous enable voice recognition and/or speaker recognition by applying voice biometrics to words that a user is likely to use, e.g., active words, depending on the context of the user’s interaction with a system or service. In this way, the processing power and memory of the system is substantially reduced with respect to text- independent voice biometric systems, while user acceptance is substantially increased with respect to text-dependent voice biometric systems because users are no longer required to recite the same phrase each time they access the system or service.
  • Database 700 also contains an h L speaker entry 780, corresponding to an n th speaker, which entry may comprise speaker name, account number, and address. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the n th user (“George Smith”,“Account Number”,“Address”,“123454”).
  • the first speaker entry is also associated with a stored Voiceprint n 785 and metadata describing the content and context of the voiceprint or other information associated with the first speaker.
  • Database 700 may also contain a second speaker entry 720 corresponding to the first speaker , which entry may comprise speaker name, credit card, and home city. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”,“Credit Card”,“2378”,“New York”).
  • the second speaker entry is also associated with a stored voiceprint 2 725 and metadata describing the content and context of the voiceprint or other information associated with the first speaker. Additional entries (and associated voiceprints and metadata) may exist for any of the users where each corresponds to different content (i.e. different uttered words and phrases) and context (i.e. different meaning of the word pair“New Y ork” as part of an address and as part of the name “New York Commercial Bank”).
  • a user and/or speaker may be asked to utter a specific passphrase (e.g.“my voice is my password...”). Other speakers are also asked to utter the same passphrase.
  • Database 700 contains entries for all speakers, where these entries correspond to the same passphrase.
  • Voiceprints are created and stored for each speaker uttering the same passphrase, together with metadata describing the content and context of the voiceprint or other information associated with the first speaker. Voiceprint creation and storage with AW
  • FIG. 4B shows an example of a database storing speaker Active Words voiceprint and related information.
  • the database 700b stores entries corresponding to speakers of the speaker identification service.
  • a database storing data related to a particular interaction context, e.g., a banking application.
  • audio files e.g. WAV, MP4,
  • system 1100 uses historical data (e.g. word frequency used in previous speaker interactions with the system and selects“Maria”,“Schneider”, “1237890”, and“6543” as AW1, AW2, AW3, and AW4, respectively. For each AW, system 1100 creates an associated Voiceprint 715b, 725b, 735b, 745b, where each AW Voiceprint is stored along with metadata describing the content and context of the voiceprint or other information associated with the first speaker.
  • historical data e.g. word frequency used in previous speaker interactions with the system and selects“Maria”,“Schneider”, “1237890”, and“6543” as AW1, AW2, AW3, and AW4, respectively.
  • system 1100 creates an associated Voiceprint 715b, 725b, 735b, 745b, where each AW Voiceprint is stored along with metadata describing the content and context of the voiceprint or other information associated with the first speaker.
  • Database 700 also contains speaker entries for other speakers, which entries may include speaker name, account number, and address which are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the other speakers. These entries are also associated with a stored Voiceprint and metadata describing the content and context of the voiceprint or other information associated with the associated speaker.
  • entries may include speaker name, account number, and address which are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the other speakers.
  • These entries are also associated with a stored Voiceprint and metadata describing the content and context of the voiceprint or other information associated with the associated speaker.
  • the speaker may be asked to utter a specific passphrase (e.g.“my voice is my password...”).
  • the same passphrase is uttered by all other speakers. All speaker utterings of the passphrase are stored in database 700 together with metadata and associated voiceprints.
  • FIG. 4C illustrates a database 400 associating users with user identifiers
  • the database includes a column 402 having a list of known users such as User A, User B, User C, and User X among other users. The number of users may be limited only by the processing and memory capacity of, for example, System 1100.
  • Column 404 includes user identifiers (e.g., ID A, IDB, IDC, and IDx, among others) corresponding to each user. A user ID may include one or more of the identifier information previously discussed herein. In some configurations, column 404 may include multiple IDs that correspond to a particular user.
  • FIG. 4D illustrates a database 440 or datastore associating active words with various interaction contexts. Column 442 includes a list of various interaction contexts.
  • An interaction context includes a type of interaction by a user with a system such as System 1100.
  • Interaction Context A may represent a banking application where users interface with System 1100 to enroll, access their banking account information, or initiate banking transactions via interactive voice responses with System 1100.
  • Interaction Context B may, for example, include an insurance application that enables users to submit claims via an IVR system.
  • Interaction Context C, other Interaction Contexts, and Interaction Context X may include a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR based application that enables users to interface with a service or product provider.
  • Column 444 lists groups of Active Words that are associated with a corresponding Interaction Context.
  • Interaction Context A is associated with three Active Words AWAI, AWA2, and AWA3. While three Active Words are shown, any number of Active Words may be associated with an Interaction Context as previously discussed herein.
  • System 1100 may select AWs that are associated with a particular Interaction Context based on one or more selection criterion, which may include a frequency of use for a particular word or phrase by a population of users that interact with System 1100 via a particular Interaction Context. For example, System 1100 may gather data regarding all users during Interaction Context A, i.e., during calls to a banking application.
  • System 1100 may identify those words, subwords, or phrases most frequently spoken by all users of the user population that use the banking application.
  • the System 1100 may designate and/or identify one or more AWs (e.g., AWAI, AWA2, and A WAS) for Interaction Context A (e.g, a banking application).
  • AWs e.g., AWAI, AWA2, and A WAS
  • Interaction Context A e.g, a banking application.
  • AWAI may be the word“account.”
  • AWA2 may be the word“card.”
  • AWA3 may be the word“one.”
  • AWci may be the word“car”
  • AWc2 may be the word “accident.”
  • System 1100 may determine that only two AWs are needed to verify a user identity using Interaction Context C.
  • FIG. 4E illustrates a database 460 that associates voiceprints of active words by various users for a particular interaction context associated with a IVR system.
  • Column 462 includes a list of users while columns 464 include voiceprints of recorded AWs of corresponding users for Interaction Context A.
  • other databases like database 460 will include voiceprints of AWs of corresponding users.
  • captured audio of AWs may be stored in addition to or instead storing voiceprint.
  • VB 1130 may generate voiceprints of stored audio of AWs in real-time or near real-time, while also generating voiceprints of detect live audio of AWs to perform speaker recognition.
  • User A has corresponding stored voiceprints VPUSERAAWAI, VPUSERAAWA2, and VPUSERAWA3.
  • voiceprints may have be pre-generated based on a prior enrollment session and/or other IVR session between User A and System 1 100.
  • ASR 1 120 may detect AWAI in User A’s speech using speech recognition ofUser A’s conversation.
  • VB 1 130 may then process the captured audio of AWAI to generate a voiceprint of the captured audio of AWAI, and compare the voice print of the captured AWAI with the stored voiceprint VPUSERAAWAI to perform voice recognition and confirm the identity ofUser A.
  • column 462 may include identification information associated with each User A, User B,
  • FIG. 5 shows the basic hardware architecture of an application server.
  • Application Server 5100 comprises a microprocessor 5110, a memory 5120, a screen adapter 5130, a hard-disk 5140, a graphics processor 5150, a communications interface adapter 5160, and a UI adapter 5170.
  • Application Server 5100 may also contain other components which are not shown in FIG.5 or lack some of the components shown in FIG.5.
  • FIG. 6 shows functional elements running on an application server 6200.
  • the fimctional elements may be implemented as hardware, software, firmware, or a combination thereof.
  • the fimctional elements may include an Operating System (OS) 6210, Utilities 6220, an Application Server Software 6230, at least one Application or Web Service 6240, and at least one Hardware driver 6250. Additional hardware and/or software components may run at the application server while some of those shown in FIG. 6 may be optionally not implemented.
  • OS Operating System
  • the method described in FIG. 2 is modified by omitting the user pre-identification step 120.
  • the Pre-identified User’s Live Voiceprint Calculation step 130 is modified to calculate the user’s live voiceprint (i.e. without knowledge of the user’s identity), and the Fetch Pre-identified User’s Stored Voiceprints step 140 is modified to fetch all stored voiceprints (i.e. belonging to all users), or all stored voiceprints of the same category or of similar characteristics.
  • Additional exemplary aspects include a method for performing speaker verification including performing a speech conversation with a first user and receiving a first user identifier.
  • the speech conversation may have an interaction context based on a subject matter of the speech conversation.
  • the method also includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users.
  • the method includes receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
  • the method further includes performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements.
  • the method also includes receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements.
  • the method also includes: comparing the plurality of corresponding text elements with each active word of the group of active words, identifying text elements matching each active word of the group of active words, generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word, comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the first user identifier is not verified.
  • a further exemplary aspect includes a system for performing speaker verification including: means for performing a speech conversation with a first user and means for receiving a first user identifier where the speech conversation having an interaction context based on a subject matter of the speech conversation; means for storing a group of active words associated with the interaction context where each active word being selected based on one or more selection criterion derived from conversations of a population of users; means for storing the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user where each first user voiceprint corresponding to each active word of the group of active words; means for performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements; means for receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; means for comparing the plurality of corresponding text elements with each active word of
  • Another exemplary aspect includes a method for verifying a speaker using natural speech including: initiating a session with one of a system or a computing device or a computer apparatus or a telephone; capturing session related data; pre-identifying the user using the session related data; capturing the user’s live speech; identifying at least one frequent word in the user’s speech; calculating the user’s live voiceprint; retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; comparing the live voiceprint with the at least one retrieved voiceprint; and verifying the user.
  • a further exemplary aspect includes a system for verifying a speaker using natural speech the system including: means for initiating a session with one of a system or a computing device or a computer apparatus or a telephone; means for capturing session related data; means for pre-identifying the user using the session related data; means for capturing the user’s live speech; means for identifying at least one frequent word in the user’s speech; means for calculating the user’s live voiceprint; means for retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; means for comparing the live voiceprint with the at least one retrieved voiceprint; and means for verifying the user.
  • Yet another exemplary aspect includes non-transitory computer program product that causes a system to verify a speaker using natural speech, the non-transitory computer program product having instructions to: initiate a session with one of a system or a computing device or a computer apparatus or a telephone; capture session related data; pre identify the user using the session related data; capture the user’s live speech; identify at least one frequent word in the user’s speech; calculate the user’s live voiceprint; retrieve at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; compare the live voiceprint with the at least one retrieved voiceprint; and verify the user.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on in a non-transitory manner or transmitted over as one or more instructions or code on a computer readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage media may be any available media that can be accessed by a computer.
  • such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer or any other device or apparatus operating as a computer.
  • Such storage mediums and/or databases on them may be referred to as datastores.
  • any connection is properly termed a computer-readable medium.
  • Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu- ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

Abstract

A system includes an interactive voice recognition (IVR) module arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation. The system includes a datastore arranged to store a group of active words associated with the interaction context and store first user voiceprints derived from pre-captured audio of the first user, where each active word is selected based on one or more selection criterion derived from conversations of a population of users. An automated speech recognition (ASR) module is arranged to perform speech recognition of the first user audio provided during the speech conversation. A voice biometric (VB) module is arranged to generate captured voiceprints and determine a similarity score based on comparisons of captured voiceprints with first user voiceprints.

Description

SYSTEMS AND METHODS FOR SPEAKER VERIFICATION
FIELD
[001] The present application relates to systems, devices, apparatuses and methods of using voice biometrics. More particularly, the application relates to voice biometrics applied to live natural speech and employing context aware models to verify users.
BACKGROUND
[002] Computing systems, as well as, devices, apparatuses and services often have a controlled access mechanism for regulating user access to their resources and data. These mechanisms may be implemented in hardware or software or a combination of the two. The most commonly used controlled access mechanism is the use of credentials, often in the form of a username and password duet. Depending on the chosen level of security, more complex mechanisms may be employed where the user must be in possession of a physical token (e.g. token generation device for e-banking code entry), key, or card (e.g. a bank card for use at Automated Teller Machine (ATMs), etc.
[003] More recently, biometric methods have been tested and employed in Voice
Biometric (VB) systems, in fingerprint reading, iris scanning, voice identification, face identification, etc. Their adoption and performance have various degrees of success. Some issues related to their success is the varying level of correct user identification as a result of environmental conditions (e.g. poor lighting, noise, multiple concurrent speakers), robustness to users pretending to be someone else or using duplicates of the real user’s characteristics to fool the identification/verification system (e.g. playback of the target user’s voice, showing a photo of the target user’s face or fingerprint), simplicity of use, and necessity for special User Interface (UI) hardware.
[004] For simplicity reasons and for avoidance of requiring special user devices, voice based user verification has been gaining acceptance as either a standalone verification mechanism or in combination with other mechanisms. Existing systems often perform either speech recognition or voice recognition, but not both processes at the same time. Speech recognition identifies particular words that are spoken, while voice recognition identifies the user or speaker that is speaking. [005] In particular, voice biometrics (i.e. processing of certain voice characteristics) are used to compare a user’s voice characteristics against stored voice characteristics of the same and other users. This mechanism can operate as“user identification” where the system identifies a user based on his voice, without previously knowing who the user is or without the user claiming and identity, and as“user verification” where the claimed or estimated identity of the user is known and the system verifies that the claimed or estimated identity is correct or not.
[006] Several methods and systems exist for voice biometrics for speaker verification and/or identification. They frequently use special text or phrases that the user dictates during an enrolment phase. Alternatively there are examples of conversational scenarios where the users should say a password and (upon correct identification of the said password) they are then asked to provide an answer to a challenge question from a set of questions previously stored by the system and which may relate to the specific user.
[007] Other variations exist, where the user is asked to answer questions, or speak out identification codes for transactions with or without passwords, or simply read aloud a text supplied by the system, known as text-prompted speaker verification, etc.
[008] Some of these systems maintain voiceprint databases and when the speaker utters any phrase in any of the above scenarios, a "live" voiceprint, or more than one, are created and compared against the corresponding voiceprints stored in the database. Following correct verification (and in some cases identification) of the user from his voiceprint, an action is taken by the system, usually granting access to the said user to a system or service.
[009] Variations of the voiceprint matching mechanisms include adaptation of relevant thresholds, statistical processing, and use of multistage matching with an initial match providing a group of potential users from those whose data are stored in the database, and subsequent stage(s) refining the matching to get the most probable match.
[0010] In some of the found references, the inventors use Automatic Speech
Recognition (ASR) and Natural Language Understanding (NLU) modules. An NLU module is used to understand what the user says by identifying words in the user's speech, which words correspond to passwords etc. or answers to system's question, which the user has to provide and the system has to capture from the user's speech and compare them against the corresponding data stored in the user's records in one or more databases. In other words, this is a simple example of a user uttering a credential such as a PIN number (“does the spoken password match the stored password”?).
[0011] Similar scenarios exist, where the user is asked to speak a word or text selected by the system and which was also used at an enrolment step, meaning that the user has to undergo a tedious and often annoying enrolment process, where they are asked to utter words and more often entire phrases upon which their voice identity will be scored, necessitating of course the repetition of the enrolled words or phrases each time they need to be identified or verified by the system.
[0012] Current voice biometric verification methods and systems have drawbacks, as previously explained, which hinder their accuracy and adoption. All the above methods also introduce significant delays until a user is recognized by their voice as they either have to go through a tedious question and answer dialogue with the system. More recent text- independent speaker recognition approaches have“shrunk” such dialogues, while the latest approaches may under certain limitations and conditions identify a user only by his request, avoiding in many cases (not always) the use of dialogue sessions. However, even the most advanced prior art does require long segments of user speech to accurately verify the user, in the order of a few seconds. Again this is not realistic on most real usage scenarios, where the user spoken request may be extremely short.
[0013] Accordingly, there is a need for innovative methods and systems for user identification and/or verification using voice biometrics that are simple to use, fast enough for real-time operation, do not need specialized hardware, and produce accurate results with minimal or, ideally, no conscious user involvement in enrollment and training steps. Such innovation should also not alter the usual user routine when verbally interacting with a system and be able to operate under little or no restrictions regarding the available user speech.
SUMMARY
[0014] An innovative solution is presented for speaker verification using voice biometrics on live natural speech or dialogues of arbitrary duration and content. The proposed speaker biometric verification is based on the notion of Active Words (AW) where the speaker’s live speech is analyzed and compared against stored voiceprints. To facilitate the verification process in the presence of very short voice segments, the present innovative solution analyzes the uttered speech to identify words, classified as“active words”, that is, words that are frequently used under a specific context (e.g. a financial transaction). Using these Active Words the present solution searches for and compares them with stored active word voiceprints for the same speaker, previously derived during past interactions of the user with the system, and created from a similar model from the previous interactions of sets of users with the system. As a result, the use of Active Words solves the problem encountered in voice biometrics where very short speech segments are not enough to provide an accurate user identification and where the user’s interaction routine with the system has to be changed.
[0015] In an exemplary aspect of the proposed innovative solution, a speaker initiates a communication session with the VB system, which transcribes the speaker’s voice to text and identifies uttered words.
[0016] The system captures and uses session related data to pre-identify the user and then limit the amount of calculations for the similarity scores between the Active Words live voiceprints and only a subset of the archived voiceprints, which belong to the same user. As a result significant improvements are achieved in processing speed, processing capability of simultaneous user sessions in the same processing infrastructure and reduction of the minimum duration of user utterance that can be used for meaningful and accurate user identification and verification.
[0017] The system identifies the current context, where the content and/or context is derived from transcribed speech analysis and/or identification data relating to the specific system application and user session. Transcribed words are analyzed to select frequently used words (AW) and to produce a speaker live voiceprint for the AW and compared to one or more stored voiceprints for the same AW of the same speaker.
[0018] The stored voiceprint(s) has been constructed during enrolment and may have been updated during previous user sessions. The stored voiceprint(s) may be stored in a compact way by exploiting correlations inside the voiceprint and with AW models representing average representations of uttered active words (and other words) of a set of system speakers-users. The system then creates similarity scores with the speaker’s archived voiceprints for the identified active words.
[0019] In another variation of this exemplary aspect, a text-independent speaker identification or verification model is also included, which extracts one voiceprint per utterance. The similarity score provided by the text-independent model is combined with the similarity scores provided by the AW-based model using score-level fusion techniques.
[0020] The present innovative solution and all its exemplary aspects allow the speaker to speak in natural language, without the need to change his pace, complexity, intonation, or any other characteristic of his speech. They also eliminate the need (as in prior art) for the user to enroll to the system the speaker’s uttering of enrolment words and phrases or repeating predefined words and phrases every time they are to be identified. Also, the present innovative solution eliminates the need to use long uttered phrases thereby avoiding altering user interaction pattern and unnecessary dialogue turns with the system to extend the user’s utterances and allow the system to accurately verify them. As a result, simple, faster and more accurate voice biometric verification is achieved.
[0021] One aspect of the invention includes a system for performing speaker verification. The system includes an interactive voice recognition (IVR) module arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation. The system includes a datastore arranged to store a group of active words associated with the interaction context where each active word is selected based on one or moreselection criterion derived from conversations of a population of users. The datastore also stores the first user identifier and a plurality of first user voiceprints derived from pre captured audio of the first user where each first user voiceprint corresponds to each active word of the group of active words. The system also includes an automated speech recognition (ASR) module arranged to perform speech recognition of the first user audio provided during the speech conversation, where the ASR module converts the first user audio including a plurality of captured audio elements into transcribed text including a
corresponding plurality of text elements;
[0022] The system further includes a voice biometric (VB) module arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based on one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user identifier is not verified.
[0023] In some implementations, the similarity score is based on the closest co mparison of one of the captured voiceprints with one of the first user voiceprints. An audio element may include at least one of a phoneme, syllable, word, subword, or phrase. A text element may include at least one of a word, subword, and phrase. In one configuration, portions of operations of the IYR, ASR, and VB are implemented in a common module.
[0024] Speech recognition may be implemented using at least one speech recognition model including vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition. The one or more of the comparisons of each captured voiceprint with each first user voiceprint may include at least one voice recognition model selected from the group of vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back off model, Neural networks, Deep feedforward and recurrent neural networks, Long short term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
[0025] In one implementation, the one or more selection criterion for active words includes at least one selected from the group of frequency of use, type of word, amount of phonemes in a word, combination of phonemes, and amount of syllables in a word or phrase. The interaction context may include a type of interaction that the first user has with the IVR. The interaction context may include a banking application, a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR-based application that enables users to interface with a service or product provider.
[0026] In another aspect, a server configured to perform speaker verification includes a communications interface arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation. The communications interface may be arranged to receive, from a datastore, a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users. The communications interface may receive, from the datastore, the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words. The server also includes a processor arranged to perform speech recognition of the first user audio provided during the speech conversation, where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements. The processor is also arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user identifier is not verified.
[0027] A further aspect includes a method for performing speaker verification including performing a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation. The method includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users and receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words. The method further includes: performing speech recognition of the first user audio provided during the speech conversation, where a processor converts the first user audio including a plurality of captured audio elements into transcribed text including a
corresponding plurality of text elements; receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; comparing the plurality of corresponding text elements with each active word of the group of active words; identifying text elements matching each active word of the group of active words; generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word; comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words; generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the first user identifier is not verified.
[0028] With respect to patent eligibility, the above aspects should not be considered directed to an abstract idea. Instead, the above aspects should be considered directed to an Internet-centric problem or improvement of computer technology related to more efficient voice or speaker recognition that advantageously reduces memory and processing demands on a voice biometric system. By using active words to perform voice recognition, a voice biometric system is able to more efficiently focus on a subset of user audio or speech. The system also advantageously combines speech recognition with voice recognition in a technically innovative way to enable rapid identification of the active words using speech recognition to which voice recognition then applied. While the above aspects could involve an abstract idea, the aspects are not directed to that idea standing alone. A long-standing problem with voice recognition is how to quickly, efficiently, and reliably verily the identity of speakers. The above aspects are directed to technically improving the speed, efficiency, and reliability, while reducing the cost in processing and memory of speaker recognition.
[0029] Even if additional features of the above aspects, when viewed individually, are considered generic computer and networking functions, an inventive concept exists because of the unconventional and non-generic combination of known elements, including using speech recognition to identify active words and then using voice biometrics or speaker recognition on those identified active words, which enables an efficient and reliable verification of the identity of a user. Furthermore, the various features and limitations of the above aspects should confine any abstract ideas to a particular and practical application of those abstract ideas such that the combination of features is not a well-understood, routine or conventional activity. The above comments should apply to other aspects described herein. BRIEF DESCRIPTION OF THE DRAWINGS
[0030] FIG.l shows a schematic diagram of a system implementing aspects of the present inventions.
[0031] FIG.2 shows basic speaker verification using natural speech as performed.
[0032] FIG.3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using Active Words.
[0033] FIG.4A shows a first example of a database storing speaker voiceprint and related information.
[0034] FIG.4B shows an example of a database storing Active Words speaker voiceprint and related information. [0035] FIG. 4C illustrates a database associating users with user identifiers.
[0036] FIG. 4D illustrates a database associating active words with various user population contexts.
[0037] FIG. 4E illustrates a database that associates voiceprints of active words by various users for a particular interaction context associated with an IVR system. [0038] FIG.5 shows the basic hardware architecture of an application server.
[0039] FIG.6 shows the basic software components running on an application server.
DETAILED DESCRIPTION
[0040] The word“exemplary” is used herein to mean“serving as an example, instance, or illustration”. Any aspect, imp lementation, or configuration described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects, implementations, or configurations. [0041] The acronym“IYR” is intended to mean“Interactive Voice Response”.
[0042] The acronym“NLU” is intended to mean“Natural Language Understanding”.
[0043] The acronym“ASR” is intended to mean“Automatic Speech Recognition”.
[0044] The acronym“DM” is intended to mean“Dialogue Manager”. [0045] The acronym“PSTN” is intended to mean“Public Switched Telephone
Network”.
[0046] The acronym“Public Land Mobile Network” is intended to mean“PLMN”.
[0047] The acronym“VAD” is intended to mean“Voice Activity Detector”.
[0048] The acronym“VA” is intended to mean“Voiceprint Adaptation”.
[0049] The acronym“UI” is intended to mean“User Interface”.
[0050] The acronym“OS” is intended to mean“Operating System”.
[0051] The term“mobile device” may be used interchangeably with“client device” and“device with wireless capabilities”. A mobile device may include, without limitation, a cellular telephone, a mobile handset, a personal digital assistant, a wireless tablet, a wireless computing device, and the like.
[0052] The term“user” may be used interchangeably with“regular user” and
“ordinary user” and“speaker”. It may also be used to mean“caller” in a telephone or VOIP call or conferencing session,“user of an application” or“user of a service”, and“participant” in a text chat, audio chat, video chat, email, audio-conference or video-conference session. [0053] The term“system” may be used interchangeably with“device”,“apparatus”, and“service”, except where it is obvious to a reader of ordinary skill in related art that these terms refer to different things, as this is apparent by the context of the discussion in which they appear. Under any circumstance, and unless otherwise explicitly or implicitly in the description, these four terms should be considered to have the broadest meaning i.e. that of encompassing all four.
[0054] The present invention treats the problem of identifying a speaker using voice biometrics applied to natural language, free speech dialogue of the speaker with a system, device, apparatus, or service. It aims to present a solution to many speaker verification problems, comprising avoiding having the speaker to dictate predefined text extracts, repeat text prompts, alter his intonation, use simplified speech, emphasize predefined keywords, use non-verbal means of identification, or add unnecessary complexity in the identification process. It also targets the use of fully automated systems, apparatuses and devices which support natural language processing and understanding, compatibility with existing
Interactive Voice Response (IVR) systems, support of natural, real-time Dialogue Manager (DM) systems, and Automatic Speech Recognition (ASR) systems.
[0055] FIG.l shows a schematic diagram of a system implementing aspects of the present inventions. A speaker can use the identification service via any type of device, comprising analogue telephone 10, digital telephone or mobile (cellular) telephone or smartphone 20, or computer or laptop or tablet 30 with voice capture and audio playback capabilities. Regardless of the device or apparatus of choice, the speaker’s speech is converted into either a voice signal (if an analogue telephone 10 is used) or a data stream and sent to 40 a Public Switched Telephone Network (PSTN) or one of Public Land Mobile Network (PLMN), Internet, Private Network, Cloud, respectively. The voice signal is converted to data (not shown in FIG.l) by a server and then fed to the back-end portion of the system, through an optional firewall 50 for enhanced security of personal and corporate data. Assuming the data traffic passes the firewall, an optional Cache and/or Proxy Server 60 handles them. The inclusion of such a server or servers is optional and can serve further enhancing data security in a mode analogous to secure browsing by isolating an Application Server 70 and the data stored in a Database 80 so that a potential unauthorized intruder can access and modify only a copy of these data and not the original sensitive data. It can also serve as a faster means of accessing commonly used data that may be dispersed to a number of servers and databases as opposed to a single server 70 and database 80.
[0056] It is obvious to a reader of ordinary skill in related art, that additional devices, apparatuses, subsystems, servers, firewalls, databases and other computing components may be added, or that some of the elements of FIG.l may be omitted (e.g. firewall 50, Cache and/or Proxy Server 60), or that their position and connections to each other may be modified without altering the scope, usability, essence, operation, result and purpose of the present invention. Speaker VB Verification
[0057] In an exemplary aspect, a speaker is verified based on his natural speech.
FIG.2 shows basic speaker verification using natural speech as performed in prior art. The method starts with the speaker initiating a communication session 100 with the exemplary system of FIG.l (or a VB server). The user may initiate a telephone session using an analogue telephone 10, a digital (Voice Over Internet Protocol (VOIP)) phone or mobile phone or smartphone 20 or tablet, laptop, or desktop computer 30. The telephone session may also be implemented as a teleconferencing - multi-party session. Alternatively, the speaker may use a software application running on his smartphone 20 or tablet, laptop, or desktop computer 30. This software application may provide VOIP telephone or teleconferencing - multi-party session, video call, video conferencing, text or voice or video chat session, e-mail session, or a combination of the preceding.
[0058] It is noted that the devices 10-30, apparatuses, subsystems and applications perform no processing other than forwarding voice signals or first digitizing voice signals, converting them into data and then forward them via any of the networks 40 shown in FIG.l to back-end servers 60, 70, and receiving voice signals or data representing voice signals for play back to the speaker.
[0059] Following the initiation of the communication session 100, the device chosen by the speaker captures his live voice 105. The back-end server or servers 60, 70 receive the voice data corresponding to the user’s live natural speech and calculate his live voiceprint 130. A voiceprint is a set of measurable characteristics of a human voice that can be used to uniquely identify the speaker. These characteristics, which are based on the physical configuration of the speaker’s mouth and throat, can be expressed as a mathematical formula or as a vector. The voiceprint can be calculated with any algorithm and contain any number of features. By means of example, the voiceprint can be calculated at a backend server by creating an i-vector (or an x-vector or other representation), i.e. a projection of a variable duration speech utterance onto a single low dimensional space, typically of a few hundred components characterizing the respective utterance. Further details regarding the use of i- vectors and/or x-vectors are described in Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, Kenny et al, 19(4), 788-798 (201 1) and X-vector : Robust DNN embeddings for speaker recognition, 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329- 5333) (April 2018), the entire contents of both of which being incorporated herein by reference.
[0060] This is followed by fetching all voiceprints corresponding to the same speaker stored 140 at the backend servers 60, 70 or at a backend database 80. The selection of voiceprints associated with this speaker are selected by using call IDentification (ID), and other similar information captured by server 70 (and/or server 60) and a lookup table. As a result the search space of speaker stored voiceprints to compare with the speaker’s live voiceprint is reduced. Alongside the stored voiceprints are stored the corresponding digitized voice files (e.g. WAV, MP4, MP3, or any other sound format), user ID data, voice metadata describing the content and context of the stored digitized voice files.
[0061] A first similarity score is calculated 150 for each pair of the speaker’s live voiceprint and each fetched voiceprint and the pair with the largest similarity score is selected 155 along with the corresponding similarity score. This similarity score is then compared to a predefined threshold T1 160. If this score is greater than or equal to Tl, then the speaker is correctly identified 170. If not, then the speaker identification has failed and the system may, optionally, prompt the speaker to repeat 180 what he just uttered or may start a dialogue to request clarifications or additional information that can be used to correctly verify the speaker against the partially-identified speaker from the call ID and other available data.
[0062] Failure of voice recognition and/or speaker verification may be due to lack of a stored voiceprint corresponding to this particular speaker, or due to noise in the voice signal, changes in speaker’s voice as a result of e.g. a sore throat, aging if the speaker has not used the user identification service for a long time, or other reasons beyond the scope of this invention. In the former case, the system cannot identify the speaker unless the speaker provides additional information beyond this used for voice biometrics. In the latter case, the repetition of the uttered speech by the speaker, or his participation in a natural speech dialogue with the system could lead to a correct identification result.
[0063] This method produces acceptable results assuming the speaker’s voiceprint is among those stored at the backend servers 60, 70 and/or the database 80 and the live speaker’s uttered speech has a sufficiently long duration (typically several seconds) for a useful live voiceprint to be constructed.. [0064] During the initial setup of the system, the database is populated with voiceprints corresponding to speaker of the system. This database population process may be repeated during the user’s new interactions with the system or for new speakers-users.
Manual intervention of an operator or support staff may optionally be used.
[0065] Real time operation and response times depend on the processing power of the application server 70 undertaking the processing, which in turn depends on the amount of stored voiceprints. For general purpose systems where the number of potential speakers (and consequently the corresponding stored voiceprints) is large, in the order of hundreds of thousands or millions, the delays introduced for processing all these data and verify the speaker can be beyond what users may consider acceptable. For this reason, the step of reducing the set of stored voiceprints that are used in the VB calculations is essential to reduce the stored voiceprint set against which to compare the live speakers voiceprint.
[0066] Nevertheless, the reduced stored voiceprint set may still be comprised of significantly different phonemic content as compared to that of the live voiceprint, making an effective comparison and subsequent scoring sub-optimal. To reduce the variability of phonemic content mismatch between the stored voiceprint and the live voiceprint, systems have required that the live voiceprint data is comprised of at least a minimum duration of speech, which ensures that the phonemic content of the live voiceprint will be enriched enough so as to statistically be not significantly different from the stored voiceprint phonemic content.
[0067] This requirement for a significantly long live speech sample so as to have a statistically similar phonemic distribution between the live and stored voiceprints, poses a significant usability and applicability challenge to real world applications. The required duration of live speech samples is often of the order of several words, or of particularly long utterances, which would not normally be found in an interaction between an application user and a conversational system. Thus, the applicability and usability of these systems in conversational interaction applications is severely limited.
Speaker VB verification using Active Words
[0068] FIG.3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using AW. The innovative Voice biometric system 1100 comprises an Interactive Voice Response (IVR) module, responsible for performing live speech conversation with a live speaker 1101, based on uttered questions and replies. The speech communication between speaker 1101 and IYR 105 is done according to what is described in FIG.l. IVR 1105 is connected to an Automatic Speech Recognition (ASR) module 1120, which ASR 1120 analyzes uttered speech and transcribes the speech into text. IVR 1105 streams voice received from speaker 1101 to ASR 1120 together with identification data. Identification data may include, without limitation, a unique subscriber identifier such as an International mobile subscriber identity (IMSI), a mobile identification number (MIN), a mobile subscription identification number (MSID or MSIN), temporary mobile subscriber identity (TMSI), Mobile Subscriber ISDN Number (MSISDN), Mobile Station International ISDN Number, Mobile International ISDN Number, Subscriber Number (SN), and/or a Packet temporary mobile subscriber identity (P-TMSI), a unique electronic serial number (ESN), a mobile device identifier, a mobile equipment identifier (MEID), an International Mobile Equipment Identifier (IMEI), a media access control (MAC) address, Android ID, a Unique Device Identifier (UDID), Universally Unique Identifier (UUID), a Core Foundation Universally Unique Identifier (CFUUID), a globally unique identifier (GUID), an OpenUDID, a SecureUDID, an unique Identifier (UIDevice), LTE static IP address (UEIP), Tracking Area Identity (TAI), Temporary Identity Number (TIN), Globally Unique Temporary UE Identity (GUTI), advertising ID - identifierF or Advertising (IDFA), identifierF orV endor (IDFV), an Android Advertising ID, Google Advertising ID, Windows Advertising ID, Apple ID (i.e., iOS App IDs), Packet Data Network Identity (PDN ID), P-GW ID, Evolved Packet System Bearer Identifier (EPS Bearer ID), E-UTRAN Radio Access Bearer Identifier (E-RAB ID), Data Radio Bearer Identifier (DRB ID), Public Land Mobile Network Identifier (PLMN ID), MME Identifier (MMEI), Globally Unique MME Identity (GUMMEI), Tracking Area Identity (TAI), Temporary Identity (TIN), SAE Temporary Mobile Subscriber Identity (S-TMSI), MME temporary mobile subscriber identity (M-TMSI), and a tunnel end point identifier (TEID), C-RNTI,
User Equipment Identifier (UEID), an email address, an IP address, a Mobile IP address, a username caller ID, , session ID, an RFID, and any other device identifier. ASR 1120 receives the streamed speech and identification data and uses an acoustic model and optionally a language model (both stored locally at ASR 1120, or at a local database, or at the cloud) to identify phonemes, syllables, words and sentences in the speaker’s 1101 speech.
[0069] The acoustic and language models used by ASR 1120 may be produced by the proposed innovative system. In some implementations, either one or both of the two models may be imported from other applications and systems, external to system 1100, where their creation involves analyzing speech files and their transcribed text content and timing information, and words, phrases, and their relationships in a given language, respectively.
The models may be automatically created by system 1100 or external systems. In yet another implementation, manual intervention may be employed during the creation of the models to determine a configuration of and/or selection of parameters for a model. Speech recognition may include automatic speech recognition (ASR), computer speech recognition, and/or speech to text (STT). Speech recognition and/or voice recognition models herein may include, without limitation Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition.
[0070] The transcribed speech may optionally be fed to a Natural Language
Understanding (NLU) unit. In one implementation, the NLU uses a Semantic Model of the used language and creates metadata that describe what the words in the transcribed text mean. For instance, the phrase“how much money did I spend at Starbucks last month” is processed by the NLU, which assigns the tag“merchant” to“starbucks” and the tag“date” to“last month”. The semantic model is independent and unrelated of the current invention and its aspects and can be imported from other systems or services.
[0071] ASR 1120 continues by streaming speech, transcribed speech (i.e. text) and control data (e.g. word and/or letter and/or sentence boundaries and tags identifying word and/or letter and/or sentence boundaries in the streamed speech) to Voice Biometric module 1130, so as to allow association of the streamed (i.e. uttered) speech with its content (i.e. transcribed text) implementations, the alignment of words may be done by the ASR module 1120 or the VB module 1130.
[0072] VB 1130 receives the speech, text and control data and uses them to identify speaker 1101. To reach an identification result, VB 1130 analyzes the transcribed text and control data to deduce the content and/or context of speech (e.g. a request to get the balance of a bank account). In some configurations, VB 1130 communicates with an NLU module (not shown in FIG.3, to receive an estimation of the context of the speech as this context is derived from the content of the transcribed speech by analyzing the natural language using a language model and rules to understand the conveyed meaning). [0073] Using the content and/or context of the speech, YB 1130 checks a lookup table
(or other data representation either locally or remotely stored) which associates a context (e.g. “I want my bank account balance please”) with a set of used words (e.g.“account” and “balance”) based on one ore more selection criterion. In one implementation, a selection criterion may be frequency of use. Other selection criteria may be used such as, without limitation, type of word, amount of phonemes in a word, combination of phonemes, and/or amount of syllables in a word or phrase. A selected word is referred to as an“Active Word” (AW). An AW is used by VB 1 130 to improve the performance of speaker 1 101 verification. This technical improvement stems from the fact that VB 1 130 uses AWs to limit the search of speaker voiceprints stored in a database 1 140, accessed by VB 1 130, to only those speaker voiceprints that corresponds to AWs, and thus improve the phonemic content match between the live speaker voiceprint and the stored speaker voiceprints. This process of matching the phonemic content of live and stored speaker voiceprints, using active words, alleviates the requirement for long speaker samples and long utterances, and allows the VB 1 130 to perform an accurate voiceprint scoring with minimal live speech samples.
[0074] This search limiting action also improves database access times as hundreds or thousands of voiceprints may be stored for the same speaker, and more importantly significantly reduces the processing needs to compare the live speaker’s speech voiceprint (computed by VB 1 130 in real time) with the voiceprints in database 1 140. As a result the number of comparisons is reduced from a scale of hundreds or thousands of comparisons to merely a few dozens or even less comparisons.
[0075] These voiceprints may, in some implementations, be stored together with metadata describing their content and/or context. These metadata may be created using speech recognition and natural language understanding analysis, as well as speaker intent analysis. They may also be captured by analyzing speaker interaction with the system, or use of an application running at a user or speaker’s computing device 20 or30 which may provide a multimodal (voice, text, graphics, video, chat, email) user interface, while enabling all voice processing at the application server 70.
[0076] Each of the stored voiceprints may correspond to a word, phrase, or paragraphs uttered by a user at an earlier session with the system. For instance,“my name is John” creates a different voiceprint from“thank you”. This is not only due to the different words in these two phrases but also due to the different phonemes, and their combinations, leading to different intonations and linking between them, and so on. These characteristics may have an effect on the number of correct user identification results and failures, especially in the presence of noise. For instance, if the live user’s voiceprint has a different context (e.g. is derived from different words or phonemes) than the stored voiceprints selected according to the pre-identification result then the live and the stored voiceprints may not be similar enough to correctly identify the user and lead to a failure or false negative result.
[0077] The choice of the amount of AWs can be based on their frequency of occurrence and on AW models generated by VB 1130 and stored in an AW database 1160. These AW models may be created as an average model from the recorded speech of all or a subset of the speakers of system 1100, i.e., a user or speaker population. Example active words may include“balance”,“account” and“please” 1160. In this example, two active words are defined for the selected context. VB 1130 could have selected more AWs or even a single AW. The choice of the number of AWs may be based on performance metrics (e.g. accuracy of result, time to compute, etc.) or a selection by the system administrator and may be structured to reflect first the frequency distribution of the words in a collection of different speakers’ speech, or of the same speaker 1101 in past uses of the system for the same context. Second, to reflect the possibility that in the presence of noise, or strange accent and intonation of the speaker 1101 (e.g. when the speaker has a sore throat) some of the uttered words may be erroneously transcribed, so using a larger set of AWs will contain at least one of the words uttered by the speaker and correctly transcribed by ASR 1120. A balance between the size of the set of AWs and VB 1130 performance may be calculated in real time by VB 1130, which may periodically adjust the size of the AW set for speaker 1101 and for the specific context in consideration. In some configurations, a system administrator may define the size of the AW sets.
[0078] VB 1130 compares the live user’s speech voiceprint with the AW speaker’s voiceprint(s) for the detected content and/or context, stored in database 1140, and produces a first similarity score for each AW speaker’s voiceprint. By means of example, comparison is done in a vector space, where both the live speaker’s speech voiceprint with the selected AW speaker’s voiceprint(s) are i-vectors (or x-vectors, or other representation). Also by means of example, i-vectors (or other representations) may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
[0079] VB 1130 uses the first similarity scores of the comparison results and selects the stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T1 and if the first similarity score is equal or exceeds Tl, then the speaker is correctly verified.
[0080] As a result, the use of AWs enables the word-by-word (or even subword-by- subword) comparison between the live user’s speech and the stored (e.g. in i-vector form) user’s sample for each AW, producing more accurate results at a fraction of the processing time required without AWs, and using live speech segments of very short duration. By means of example, AWs can provide accurate identification results using live speech of only a single word or of word segments (e.g. syllables or phoneme combinations) which is very useful in situations when the speaker is very“stingy” with his speech conversation with system 1100 or when segments of his uttered speech are corrupted beyond ASR 1120 recognition due to noise or other reasons related to the speaker or to external influences.
[0081] System 1100 is managed by a Dialogue Manager (DM) module 1110 which controls the operation of all other modules 1105, 1120, 1130, 1150. DM 1110 is controlled by an application developer or system administrator and exchanges control data with the other system components. DM 1110 controls the operation of IVR 1105 to start and perform a dialogue with speaker 1101 in order to get his request to be serviced by system 1100 (e.g. to provide the speaker with the balance of his bank account), or to request further information (e.g. which account, credit or savings account), or to prompt the speaker to get more speech input in cases where an accurate speaker identification or verification is not possible using the already uttered speech by the speaker (e.g. very noisy sample, truncated, etc.).
Communication between IYR 1105 and DM 1110 may be done using any publicly available or proprietary protocol.
[0082] DM module 1110 also communicates with VB module 1130 to initiate and control a VB session using the streamed and transcribed speech and identification data received at VB module 1130 from ASR module 1120. By means of example, VB 1130 signals DM 1110 that speaker identification is not possible with high accuracy using the available live speech specimen. DM 1110 may then instruct VB 1130 to store the initial speech, text and identification data and initiate a second round of identification, also involving IVR 1105 to get better scoring that will increase the accuracy of user identification. If identification is achieved, the result is signaled by YB 1130 to DM 1110 and DM 1110 signals all system modules to end current session. DM 1110 may then output the
identification result to external systems (not shown in FIG. 3).
[0083] The efficient operation of the system and the improvements brought by the use of AWs and pre-identification information allow in an exemplary scenario of use to verify a user who calls the system from his fixed- line or from his smartphone and utters only the phrase“account balance” if he has a single account with the bank he is call at, or“balance” if he has only one account and not credit etc. cards with the same bank. In other words the proposed innovative method and system are able to verify the user’s identity and then service his request using the pre-identification information, the derived context of the speech session and a live speech sample of minimal duration (even as low as 1 second or even less).
Speaker Enrollment with VB system using AW
[0084] A speaker calls into system 1100 listens to a first speech prompt from IVR
1105, and speaks a first utterance.
[0085] DM 1110 receives identification data (e.g. Automatic Number Identification
Data (ANI), caller ID, or MAC address, or other) from IVR 1105 and partially identifies (pre identifies) the speaker from the identification information. The speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner’s consent or without it).
[0086] DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
[0087] If the voiceprint does not exist, VB 1130 initializes a new (empty) voiceprint, receives the utterance transcription from ASR 1120, identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, creates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and returns back to DM 1110 a control message (e.g. voiceprint _does_not_exist). [0088] DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker. The speaker, in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120. ASR 1120 then relays the streamed speech to VB 1130. In a variation of the present exemplary implementation, the streamed speech is sent by IVR 1105 to DM 1110, and it is DM 1110 that relays the streamed speech to VB 1130.
[0089] ASR 1120 produces a transcription of the second utterance, which is sent to
VB 1130.
[0090] VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint.
[0091] If the sample quality is not good (e.g. one or more quality metric is below a corresponding threshold), the sample is discarded and a“badjsample” response is returned by VB 1130 to DM 1110.
[0092] If the sample quality is good, VB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and also stores the AW speech samples (that have been segmented and isolated from the speaker’s speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140.
[0093] The process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW. The collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105, or during several interactions. However, speaker voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140. So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of user (full not partial) identification than VB (e.g. key-in or utter a password).
[0094] Until a speaker voiceprint is created, his voice samples are stored to database
1140 and a“voiceprint creation failure reason” (e.g.
“voiceprint j reation _failure_reason=notjenough_sample”) message is sent from VB 1130 to DM 1110. Upon successful creation of a speaker voiceprint for an AW, the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a“voiceprint_creation_success” message is sent by VB 1130 to DM 1110.
[0095] It is noted that it is obvious to a person of ordinary skill in related art that the above steps and messages may be altered, deleted, substituted with other, and new messages and steps be added without altering the scope and/or context of the proposed innovative solution.
Speaker Verification with VB system using AW
[0096] A speaker calls in system 1100, listens to a first speech prompt from IVR
1105, and speaks a first utterance.
[0097] DM 1110 receives identification data (e.g. Automatic Number Identification
Data (ANI), caller ID, or MAC address, or other) from IVR 1105 and partially identifies the speaker from the identification information. The speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner’s consent or without it).
[0098] DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
[0099] In one implementation, VB 1130 returns back to DM 1110 a
“voiceprint _exists” message and DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker. The speaker, in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120. ASR 1120 then relays the streamed speech to VB 1130. In some implementations, the streamed speech is sent by IVR 1105 to DM 1110, and it is DM 1110 that relays the streamed speech to VB 1130.
[00100] ASR 1120 transcribes the utterance and forwards it to VB 1130 (directly or via
DM 1110). VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint. If the sample quality is not good (e.g. one or more quality metric is below a corresponding threshold), the sample is discarded and a“bad_sample” response is returned by VB 1130 to DM 1110. [00101] If the sample quality is good, YB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to each AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system’s AW models, and also stores the AW speech samples (that have been segmented and isolated from the user’s speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140.
[00102] The process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW. The collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105, or during several interactions. However, voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140. So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of speaker (full not partial) identification than VB (e.g. key-in or utter a password).
[00103] Until a user voiceprint is created, the speaker’s voice samples are stored to database 1140 and a“voiceprint _creation failure j-eason” (e.g.
“voiceprint verification jscoring failure_reason =need_more_Active Words”) message is sent from VB 1130 to DM 1110. Upon successful creation of a speaker voiceprint for an AW, the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a“voiceprint jcreation_success” message is sent by VB 1130 to DM 1110.
[00104] Once one or more voiceprints are created and associated with one or more
AWs for the user, System 1100 may subsequently perform Voice recognitions and/or speaker verifications by comparing voiceprints of captured AWs during a user call with stored user- specific voiceprints associated with AWs. In some implementations, the voiceprints may be created using a model that creates i-vectors (or x-vectors or other representation). As previously discussed, VB 1130 may compare the live user’s speech voiceprint with the stored AW speaker’s voiceprint(s) for the detected content stored in database 1140 and/or 460, and produce a first similarity score for each voiceprint associated with the user. For example, VB 1130 may perform comparisons in a vector space, where both the detected live speaker’s or user’s speech voiceprint with the selected AW speaker’s voiceprint(s) are first converted into an i-vector (or x-vector, or other representation). Then, i-vectors (or other representations) may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
[00105] VB 1130 uses the first similarity scores of the comparison results and selects a stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T1 and if the first similarity score is equal or exceeds Tl, then the speaker is correctly verified.
[00106] As previously discussed, other speech and/or voice recognition models may be employed by VB 1130 to determine similarity scores based on comparing a voiceprint from detected live AW with a stored voiceprint associated with an AW of a user.
[00107] In another implementation, a single uttered speech is used to identify the speaker and no second or subsequent uttered speech segments are needed.
[00108] It is noted that it is obvious to a person of ordinary skill in related art that the above steps and messages may be altered, deleted, substituted with other, and new messages and steps be added without altering the scope and/or context of the proposed innovative solution.
Voiceprint creation and storage
[00109] FIG. 4A shows a first example of a database storing speaker voiceprint and related information. The database 700 stores entries corresponding to speakers of the speaker identification service. By means of example, we consider a database storing data related to a banking application. A first speaker entry 710, corresponding to a first speaker, which entry may comprise random speech (e.g. speaker name, debit account, and card number. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”,“Debit Account”,“Card Number”,“1237890”,“6543”). The first speaker entry is also associated with a stored Voiceprint 1 715 and metadata describing the content and context of the voiceprint or other information associated with the first speaker. In existing systems, the content of the speech may be either random speech, requiring text-independent voice biometrics or passphrase-specific (e.g.,“my voice is my password”), requiring text- dependent voice biometrics. The present disclosure describes systems and methods that advantageous enable voice recognition and/or speaker recognition by applying voice biometrics to words that a user is likely to use, e.g., active words, depending on the context of the user’s interaction with a system or service. In this way, the processing power and memory of the system is substantially reduced with respect to text- independent voice biometric systems, while user acceptance is substantially increased with respect to text-dependent voice biometric systems because users are no longer required to recite the same phrase each time they access the system or service.
[00110] Database 700 also contains an hL speaker entry 780, corresponding to an nth speaker, which entry may comprise speaker name, account number, and address. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the nth user (“George Smith”,“Account Number”,“Address”,“123454”). The first speaker entry is also associated with a stored Voiceprint n 785 and metadata describing the content and context of the voiceprint or other information associated with the first speaker.
[00111] Database 700 may also contain a second speaker entry 720 corresponding to the first speaker , which entry may comprise speaker name, credit card, and home city. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”,“Credit Card”,“2378”,“New York”). The second speaker entry is also associated with a stored voiceprint 2 725 and metadata describing the content and context of the voiceprint or other information associated with the first speaker. Additional entries (and associated voiceprints and metadata) may exist for any of the users where each corresponds to different content (i.e. different uttered words and phrases) and context (i.e. different meaning of the word pair“New Y ork” as part of an address and as part of the name “New York Commercial Bank”).
[00112] In some implementations, a user and/or speaker may be asked to utter a specific passphrase (e.g.“my voice is my password...”). Other speakers are also asked to utter the same passphrase. Database 700 contains entries for all speakers, where these entries correspond to the same passphrase. Voiceprints are created and stored for each speaker uttering the same passphrase, together with metadata describing the content and context of the voiceprint or other information associated with the first speaker. Voiceprint creation and storage with AW
[00113] FIG. 4B shows an example of a database storing speaker Active Words voiceprint and related information. The database 700b stores entries corresponding to speakers of the speaker identification service. By means of example, we consider a database storing data related to a particular interaction context, e.g., a banking application. A first speaker entry 710b, corresponding to a first speaker, which entry may comprise speaker name, debit account, and card number. These are stored as audio files (e.g. WAV, MP4,
MP3, and the like) that were uttered by the first speaker (“Maria Schneider”,“Debit
Account”,“Card Number”,“1237890”,“6543”).
[00114] In one implementation, system 1100 uses historical data (e.g. word frequency used in previous speaker interactions with the system and selects“Maria”,“Schneider”, “1237890”, and“6543” as AW1, AW2, AW3, and AW4, respectively. For each AW, system 1100 creates an associated Voiceprint 715b, 725b, 735b, 745b, where each AW Voiceprint is stored along with metadata describing the content and context of the voiceprint or other information associated with the first speaker.
[00115] Database 700 also contains speaker entries for other speakers, which entries may include speaker name, account number, and address which are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the other speakers. These entries are also associated with a stored Voiceprint and metadata describing the content and context of the voiceprint or other information associated with the associated speaker. In some
implementations, the speaker may be asked to utter a specific passphrase (e.g.“my voice is my password...”). The same passphrase is uttered by all other speakers. All speaker utterings of the passphrase are stored in database 700 together with metadata and associated voiceprints.
[00116] FIG. 4C illustrates a database 400 associating users with user identifiers
(IDs). The database includes a column 402 having a list of known users such as User A, User B, User C, and User X among other users. The number of users may be limited only by the processing and memory capacity of, for example, System 1100. Column 404 includes user identifiers (e.g., ID A, IDB, IDC, and IDx, among others) corresponding to each user. A user ID may include one or more of the identifier information previously discussed herein. In some configurations, column 404 may include multiple IDs that correspond to a particular user. [00117] FIG. 4D illustrates a database 440 or datastore associating active words with various interaction contexts. Column 442 includes a list of various interaction contexts. An interaction context includes a type of interaction by a user with a system such as System 1100. For example, Interaction Context A may represent a banking application where users interface with System 1100 to enroll, access their banking account information, or initiate banking transactions via interactive voice responses with System 1100. Interaction Context B may, for example, include an insurance application that enables users to submit claims via an IVR system. Interaction Context C, other Interaction Contexts, and Interaction Context X may include a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR based application that enables users to interface with a service or product provider. Column 444 lists groups of Active Words that are associated with a corresponding Interaction Context. For example, Interaction Context A is associated with three Active Words AWAI, AWA2, and AWA3. While three Active Words are shown, any number of Active Words may be associated with an Interaction Context as previously discussed herein. As previously discussed herein, System 1100 may select AWs that are associated with a particular Interaction Context based on one or more selection criterion, which may include a frequency of use for a particular word or phrase by a population of users that interact with System 1100 via a particular Interaction Context. For example, System 1100 may gather data regarding all users during Interaction Context A, i.e., during calls to a banking application. As previously discussed, System 1100 may identify those words, subwords, or phrases most frequently spoken by all users of the user population that use the banking application. The System 1100 may designate and/or identify one or more AWs (e.g., AWAI, AWA2, and A WAS) for Interaction Context A (e.g, a banking application). AWAI may be the word“account.” AWA2 may be the word“card.” and AWA3 may be the word“one.” In a different interaction context such as Interaction Context C, e.g., an auto insurance application, AWci may be the word“car” and AWc2 may be the word “accident.” System 1100 may determine that only two AWs are needed to verify a user identity using Interaction Context C.
[00118] FIG. 4E illustrates a database 460 that associates voiceprints of active words by various users for a particular interaction context associated with a IVR system. Column 462 includes a list of users while columns 464 include voiceprints of recorded AWs of corresponding users for Interaction Context A. For other Interaction Contexts, other databases like database 460 will include voiceprints of AWs of corresponding users. In some implementations, captured audio of AWs may be stored in addition to or instead storing voiceprint. In instances where only captured audio of AWs is stored, VB 1130 may generate voiceprints of stored audio of AWs in real-time or near real-time, while also generating voiceprints of detect live audio of AWs to perform speaker recognition. For example, User A has corresponding stored voiceprints VPUSERAAWAI, VPUSERAAWA2, and VPUSERAWA3.
These voiceprints may have be pre-generated based on a prior enrollment session and/or other IVR session between User A and System 1 100. During a current session, ASR 1 120 may detect AWAI in User A’s speech using speech recognition ofUser A’s conversation. VB 1 130 may then process the captured audio of AWAI to generate a voiceprint of the captured audio of AWAI, and compare the voice print of the captured AWAI with the stored voiceprint VPUSERAAWAI to perform voice recognition and confirm the identity ofUser A. The same approach may be applied to detected audio of AWA2 and AWA3. In some implementations, column 462 may include identification information associated with each User A, User B,
User C, User X, and so on.
[00119] FIG. 5 shows the basic hardware architecture of an application server.
Application Server 5100 comprises a microprocessor 5110, a memory 5120, a screen adapter 5130, a hard-disk 5140, a graphics processor 5150, a communications interface adapter 5160, and a UI adapter 5170. Application Server 5100 may also contain other components which are not shown in FIG.5 or lack some of the components shown in FIG.5.
[00120] FIG. 6 shows functional elements running on an application server 6200. The fimctional elements may be implemented as hardware, software, firmware, or a combination thereof. The fimctional elements may include an Operating System (OS) 6210, Utilities 6220, an Application Server Software 6230, at least one Application or Web Service 6240, and at least one Hardware driver 6250. Additional hardware and/or software components may run at the application server while some of those shown in FIG. 6 may be optionally not implemented.
[00121] In another exemplary implementation, the method described in FIG. 2 is modified by omitting the user pre-identification step 120. Furthermore, the Pre-identified User’s Live Voiceprint Calculation step 130 is modified to calculate the user’s live voiceprint (i.e. without knowledge of the user’s identity), and the Fetch Pre-identified User’s Stored Voiceprints step 140 is modified to fetch all stored voiceprints (i.e. belonging to all users), or all stored voiceprints of the same category or of similar characteristics. [00122] Additional exemplary aspects include a method for performing speaker verification including performing a speech conversation with a first user and receiving a first user identifier. The speech conversation may have an interaction context based on a subject matter of the speech conversation. The method also includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users. The method includes receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words. The method further includes performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements. The method also includes receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements. The method also includes: comparing the plurality of corresponding text elements with each active word of the group of active words, identifying text elements matching each active word of the group of active words, generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word, comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the first user identifier is not verified.
[00123] A further exemplary aspect includes a system for performing speaker verification including: means for performing a speech conversation with a first user and means for receiving a first user identifier where the speech conversation having an interaction context based on a subject matter of the speech conversation; means for storing a group of active words associated with the interaction context where each active word being selected based on one or more selection criterion derived from conversations of a population of users; means for storing the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user where each first user voiceprint corresponding to each active word of the group of active words; means for performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements; means for receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; means for comparing the plurality of corresponding text elements with each active word of the group of active words; means for identifying text elements matching each active word of the group of active words; means for generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word; means for comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words; means for generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, means for indicating that the first user identifier is verified or if the similarity score is less than the threshold value, mean for indicating that the first user identifier is not verified.
[00124] Another exemplary aspect includes a method for verifying a speaker using natural speech including: initiating a session with one of a system or a computing device or a computer apparatus or a telephone; capturing session related data; pre-identifying the user using the session related data; capturing the user’s live speech; identifying at least one frequent word in the user’s speech; calculating the user’s live voiceprint; retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; comparing the live voiceprint with the at least one retrieved voiceprint; and verifying the user.
[00125] A further exemplary aspect includes a system for verifying a speaker using natural speech the system including: means for initiating a session with one of a system or a computing device or a computer apparatus or a telephone; means for capturing session related data; means for pre-identifying the user using the session related data; means for capturing the user’s live speech; means for identifying at least one frequent word in the user’s speech; means for calculating the user’s live voiceprint; means for retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; means for comparing the live voiceprint with the at least one retrieved voiceprint; and means for verifying the user.
[00126] Yet another exemplary aspect includes non-transitory computer program product that causes a system to verify a speaker using natural speech, the non-transitory computer program product having instructions to: initiate a session with one of a system or a computing device or a computer apparatus or a telephone; capture session related data; pre identify the user using the session related data; capture the user’s live speech; identify at least one frequent word in the user’s speech; calculate the user’s live voiceprint; retrieve at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data; compare the live voiceprint with the at least one retrieved voiceprint; and verify the user.
[00127] The above exemplary implementations are intended for use either as a standalone user voice biometric identification method in any conceivable scientific and business domain, or as part of other scientific and business methods, processes and systems.
[00128] The above exemplary descriptions are simplified and may not include certain hardware and software elements that may be used in the implementations because they may not be part of an inventive concept, are not needed for the understanding of the
implementations, and are with in the common general knowledge of a person of ordinary skill in the art. Furthermore, variations of the described method, system architecture, and software architecture are possible, where, for instance, method steps, and hardware and software elements may be rearranged, omitted, or newly added.
[00129] Various aspects of the invention are described above in the Detailed
Description. While these descriptions directly describe the above aspects, it is understood that those skilled in the art may conceive modifications and/or variations to the specific aspects shown and described herein. Any such modifications or variations that fall within the purview of this description are intended to be included therein as well. Unless specifically noted, it is the intention of the inventor that the words and phrases in the specification and claims be given the ordinary and accustomed meanings to those of ordinary skill in the applicable art(s). [00130] The foregoing description of various exemplary aspects of the application has been presented and is intended for the purposes of illustration and description. It is not intended to be exhaustive or limit the invention to the precise form disclosed and many modifications and variations are possible in the light of the above teachings. The exemplary aspects were chosen and described to explain the principles of the invention and its practical application and to enable others skilled in the art to best utilize the inventions in various aspects and with various modifications as are suited to the particular use contemplated. Therefore, it is intended that the inventions not be limited to the particular implementations disclosed herein for carrying out the inventions, but that the inventions will include all aspects falling within the scope of the appended claims.
[00131] In one or more exemplary aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on in a non-transitory manner or transmitted over as one or more instructions or code on a computer readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer or any other device or apparatus operating as a computer. Such storage mediums and/or databases on them may be referred to as datastores. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu- ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. [00132] The above description of the disclosed exemplary aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these exemplary aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
[00133] What is claimed is:

Claims

1. A system for performing speaker verification comprising: an interactive voice recognition (IVR) module arranged to perform a speech conversation with a first user and receive a first user identifier, the speech conversation having an interaction context based on a subject matter of the speech conversation;
a datastore arranged to:
store a group of active words associated with the interaction context, each active word being selected based on one or more selection criterion derived from conversations of a population of users;
store the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, each first user voiceprint corresponding to each active word of the group of active words;
an automated speech recognition (ASR) module arranged to perform speech recognition of the first user audio provided during the speech conversation, the ASR module converting the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements;
a voice biometric (VB) module arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based on one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user identifier is not verified.
2. The system of claim 1, wherein the similarity score is based on the closest comparison of one of the captured voiceprints with one of the first user voiceprints.
3. The system of any of claims 1 and 2, wherein an audio element includes at least one selected from the group of a phoneme, syllable, word, subword, and phrase.
4. The system of any of claims 1-3, wherein a text element includes at least one selected from the group of a word, subword, and phrase.
5. The system of any of claims 1-4, wherein portions of operations of the IVR, ASR, and YB are implemented in a common module.
6. The system of any of claims 1-5, wherein the speech recognition includes at least one speech recognition model selected from the group of vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
7. The system of any of claims 1-6, wherein the one or more of the comparisons of each captured voiceprint with each first user voiceprint includes at least one voice recognition model selected from the group of vector analyses, Hidden Markov models (HMMs),
Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
8. The system of any of claims 1-7, wherein the one or more selection criterion includes at least one selected from the group of frequency of use, type of word, amount of phonemes in a word, combination of phonemes, and amount of syllables in a word or phrase.
9. The system of any of claims 1-8, wherein the interaction context includes a type of interaction that the first user has with the IVR.
10. The system of claims 9, wherein the interaction context includes any one of a banking application, a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR-based application that enables users to interface with a service or product provider.
11. A server configured to perform speaker verification comprising: a communications interface arranged to perform a speech conversation with a first user and receive a first user identifier, the speech conversation having an interaction context based on a subject matter of the speech conversation;
the communications interface further arranged to:
receive, from a datastore, a group of active words associated with the interaction context, each active word being selected based on one or more selection criterion derived from conversations of a population of users;
receive, from the datastore, the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, each first user voiceprint corresponding to each active word of the group of active words; and
a processor arranged to perform speech recognition of the first user audio provided during the speech conversation, the processor converting the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements;
the processor further arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user identifier is not verified.
12. The server of claim 11, wherein the similarity score is based on the closest comparison of one of the captured voiceprints with one of the first user voiceprints.
13. The server of any of claims 11 and 12, wherein an audio element includes at least one selected from the group of a phoneme, syllable, word, subword, and phrase.
14. The server of any of claims 11-13, wherein a text element includes at least one selected from the group of a word, subword, and phrase.
15. The server of any of claims 11-14, wherein portions of datastore are implemented in the server.
16. The server of any of claims 11-15, wherein the speech recognition includes at least one speech recognition model selected from the group of vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
17. The server of any of claims 11-16, wherein the one or more of the comparisons of each captured voiceprint with each first user voiceprint includes at least one voice recognition model selected from the group of vector analyses, Hidden Markov models (HMMs),
Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
18. The server of any of claims 11-17, wherein the one or more selection criterion includes at least one selected from the group of frequency of use, type of word, amount of phonemes in a word, combination of phonemes, and amount of syllables in a word or phrase.
19. The server of any of claims 11-18, wherein the interaction context includes a type of interaction that the first user has with the IVR.
20. The server of claims 19, wherein the interaction context includes any one of a banking application, a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR-based application that enables users to interface with a service or product provider.
21. A method for performing speaker verification comprising: performing a speech conversation with a first user and receive a first user identifier, the speech conversation having an interaction context based on a subject matter of the speech conversation;
receiving a group of active words associated with the interaction context, each active word being selected based on one or more selection criterion derived from conversations of a population of users,
receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, each first user voiceprint corresponding to each active word of the group of active words; and
performing speech recognition of the first user audio provided during the speech conversation, the processor converting the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements; receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements;
comparing the plurality of corresponding text elements with each active word of the group of active words;
identifying text elements matching each active word of the group of active words; generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word;
comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words;
generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and
if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the first user identifier is not verified.
22. A system for performing speaker verification comprising: means for performing a speech conversation with a first user and receive a first user identifier, the speech conversation having an interaction context based on a subject matter of the speech conversation;
means for storing a group of active words associated with the interaction context, each active word being selected based on one or more selection criterion derived from
conversations of a population of users,
means for storing the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, each first user voiceprint corresponding to each active word of the group of active words; and
means for performing speech recognition of the first user audio provided during the speech conversation, the processor converting the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements;
means for receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; means for comparing the plurality of corresponding text elements with each active word of the group of active words;
means for identifying text elements matching each active word of the group of active words;
means for generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word;
means for comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words;
means for generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and
if the similarity score is greater than or equal to a threshold value, means for indicating that the first user identifier is verified or if the similarity score is less than the threshold value, mean for indicating that the first user identifier is not verified.
23. A method for verifying a speaker using natural speech, comprising:
initiating a session with one of a system or a computing device or a computer apparatus or a telephone;
capturing session related data;
pre-identifying the user using the session related data;
capturing the user’s live speech;
identifying at least one frequent word in the user’s speech;
calculating the user’s live voiceprint;
retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data;
comparing the live voiceprint with the at least one retrieved voiceprint; and verifying the user.
24. A system for verifying a speaker using natural speech, the system comprising:
means for initiating a session with one of a system or a computing device or a computer apparatus or a telephone;
means for capturing session related data;
means for pre-identifying the user using the session related data;
means for capturing the user’s live speech;
means for identifying at least one frequent word in the user’s speech; means for calculating the user’s live voiceprint;
means for retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data;
means for comparing the live voiceprint with the at least one retrieved voiceprint; and
means for verifying the user.
25. A non-transitory computer program product that causes a system to verify a speaker using natural speech, the non-transitory computer program product having instructions to:
initiate a session with one of a system or a computing device or a computer apparatus or a telephone;
capture session related data; pre-identify the user using the session related data;
capture the user’s live speech;
identify at least one frequent word in the user’s speech;
calculate the user’s live voiceprint;
retrieve at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user’s pre-identification data;
compare the live voiceprint with the at least one retrieved voiceprint; and verify the user.
PCT/EP2019/057475 2019-03-25 2019-03-25 Systems and methods for speaker verification WO2020192890A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/057475 WO2020192890A1 (en) 2019-03-25 2019-03-25 Systems and methods for speaker verification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/057475 WO2020192890A1 (en) 2019-03-25 2019-03-25 Systems and methods for speaker verification

Publications (1)

Publication Number Publication Date
WO2020192890A1 true WO2020192890A1 (en) 2020-10-01

Family

ID=65911190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/057475 WO2020192890A1 (en) 2019-03-25 2019-03-25 Systems and methods for speaker verification

Country Status (1)

Country Link
WO (1) WO2020192890A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992153A (en) * 2021-04-27 2021-06-18 太平金融科技服务(上海)有限公司 Audio processing method, voiceprint recognition device and computer equipment
CN113033191A (en) * 2021-03-30 2021-06-25 上海思必驰信息科技有限公司 Voice data processing method, electronic device and computer readable storage medium
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium
WO2022169931A1 (en) * 2021-02-03 2022-08-11 Komal Dangi System and method of authenticating candidates for job positions
EP4047496A1 (en) * 2021-02-17 2022-08-24 ValidSoft Limited Processing input data
WO2022233239A1 (en) * 2021-05-07 2022-11-10 华为技术有限公司 Upgrading method and apparatus, and electronic device
GB2612032A (en) * 2021-10-19 2023-04-26 Validsoft Ltd An authentication system and method
WO2023070458A1 (en) * 2021-10-28 2023-05-04 Boe Technology Group Co., Ltd. Method for outputting voice transcript, voice transcript generating system, and computer-program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5414755A (en) * 1994-08-10 1995-05-09 Itt Corporation System and method for passive voice verification in a telephone network
EP1507394A2 (en) * 2003-08-11 2005-02-16 Microsoft Corporation Speech recognition enhanced caller identification
US20130132091A1 (en) * 2001-01-31 2013-05-23 Ibiometrics, Inc. Dynamic Pass Phrase Security System (DPSS)
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5414755A (en) * 1994-08-10 1995-05-09 Itt Corporation System and method for passive voice verification in a telephone network
US20130132091A1 (en) * 2001-01-31 2013-05-23 Ibiometrics, Inc. Dynamic Pass Phrase Security System (DPSS)
EP1507394A2 (en) * 2003-08-11 2005-02-16 Microsoft Corporation Speech recognition enhanced caller identification
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"X-vectors: Robust DNN embeddingsfor speaker recognition", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, April 2018 (2018-04-01), pages 5329 - 5333
KENNY ET AL.: "Front-end factor analysis for speaker verification", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 19, no. 4, 2011, pages 788 - 798, XP011352006, DOI: doi:10.1109/TASL.2010.2064307

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022169931A1 (en) * 2021-02-03 2022-08-11 Komal Dangi System and method of authenticating candidates for job positions
EP4047496A1 (en) * 2021-02-17 2022-08-24 ValidSoft Limited Processing input data
CN113033191A (en) * 2021-03-30 2021-06-25 上海思必驰信息科技有限公司 Voice data processing method, electronic device and computer readable storage medium
CN112992153A (en) * 2021-04-27 2021-06-18 太平金融科技服务(上海)有限公司 Audio processing method, voiceprint recognition device and computer equipment
CN112992153B (en) * 2021-04-27 2021-08-17 太平金融科技服务(上海)有限公司 Audio processing method, voiceprint recognition device and computer equipment
WO2022233239A1 (en) * 2021-05-07 2022-11-10 华为技术有限公司 Upgrading method and apparatus, and electronic device
GB2612032A (en) * 2021-10-19 2023-04-26 Validsoft Ltd An authentication system and method
EP4170526A1 (en) * 2021-10-19 2023-04-26 ValidSoft Limited An authentication system and method
WO2023070458A1 (en) * 2021-10-28 2023-05-04 Boe Technology Group Co., Ltd. Method for outputting voice transcript, voice transcript generating system, and computer-program product
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20200312337A1 (en) Systems and methods for speaker verification
WO2020192890A1 (en) Systems and methods for speaker verification
US11900948B1 (en) Automatic speaker identification using speech recognition features
CN106373575B (en) User voiceprint model construction method, device and system
JP6740504B1 (en) Utterance classifier
US10810997B2 (en) Automated recognition system for natural language understanding
AU2016216737B2 (en) Voice Authentication and Speech Recognition System
US20160372116A1 (en) Voice authentication and speech recognition system and method
US7711105B2 (en) Methods and apparatus for processing foreign accent/language communications
US7222072B2 (en) Bio-phonetic multi-phrase speaker identity verification
US8484031B1 (en) Automated speech recognition proxy system for natural language understanding
JP5042194B2 (en) Apparatus and method for updating speaker template
KR102097710B1 (en) Apparatus and method for separating of dialogue
US10972609B2 (en) Caller deflection and response system and method
WO2016194740A1 (en) Speech recognition device, speech recognition system, terminal used in said speech recognition system, and method for generating speaker identification model
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN114385800A (en) Voice conversation method and device
KR20230098266A (en) Filtering the voice of other speakers from calls and audio messages
US20080316888A1 (en) Device Method and System for Communication Session Storage
CN112331217A (en) Voiceprint recognition method and device, storage medium and electronic equipment
US10147417B2 (en) Electronic speech recognition name directory prognostication system by comparing a spoken name's packetized voice to stored phonemes
CN112565242B (en) Remote authorization method, system, equipment and storage medium based on voiceprint recognition
CN113593580A (en) Voiceprint recognition method and device
CN113744742A (en) Role identification method, device and system in conversation scene
US11558506B1 (en) Analysis and matching of voice signals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19713461

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19713461

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.07.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19713461

Country of ref document: EP

Kind code of ref document: A1