NL2012300C2

NL2012300C2 - Automated audio optical system for identity authentication.

Info

Publication number: NL2012300C2
Application number: NL2012300A
Authority: NL
Inventors: Joost Johannes Hendrikus Christiaan Doremalen; Martijn Enter
Original assignee: Novolanguage B V
Priority date: 2014-02-21
Filing date: 2014-02-21
Publication date: 2015-08-25

Description

Automated audio optical system for identity authentication

DESCRIPTION

FIELD OF THE INVENTION

The present invention is in the field of automated systems and methods for audio optical identity authentication.

BACKGROUND OF THE INVENTION

Authentication relates to confirming the authenticity of an attribute and specifically of a person. Such may involve confirming an identity of a person. Authentication often involves verifying the validity of at least one form of identification.

One type of authentication is comparing characteristics of an object (e.g. person) to what is known about objects of that origin. For example, an art expert might look for similarities in the style of painting, check the location and form of a signature, or compare the object to an old photograph. Physics of sound and light, and comparison with a known physical environment, such as a previously recorded environment, can be used to examine the authenticity of audio recordings, photographs, or videos.

Authentication of a person can be split into categories, namely: something a person knows, something a person has, and something a person is. Each authentication category covers a range of elements used to authenticate or verify a person's identity. The present invention lies in establishing a relationship of physical and personal characteristics of a person within the three mentioned categories in a first situation and in a second situation, and comparing the characteristics of the two situations, in order to establish if the person in the first situation is the .same as in the second situation.

It is noted that for the second mentioned category, which typically relates to attributes of a person, comparison may be vulnerable to forgery. In general, it relies on the facts that creating a forgery indistinguishable from a genuine artifact requires (expert) knowledge, that mistakes are easily made, and that the amount of effort required to do so is considerably greater than the amount of profit that can be gained from the forgery. In order to prevent such forgery or abuse a watertight system needs to be developed.

It is considered that for a "positive" authentication, elements from at least two, and preferably all three, categories should be verified. Such is the aim of the present invention. Some example of each category are: ownership: (something a user has) ID card, security token, wrist band, software token, and cell phone; knowledge: (something a user knows) personal identification

number (PIN), a password, pass phrase, challenge response, a pattern; inherence: (something a user is or does) DNA sequence, retinal pattern, fingerprint, signature, face, voice, unique bio-electric signals, other biometric identifier .

The present system is aimed at using all three categories (sometimes referred to as factors) for authentication, and to comparing authentication in a first situation to authentication in a second situation.

The present invention therefore relates to automated systems and methods for audio optical identity authentication, which overcomes one or more of the above disadvantages, without jeopardizing functionality and advantages .

SUMMARY OF THE INVENTION

The present invention relates in a first aspect to an automated system according to claim 1 comprising various electronic elements. The automated system is typically at least partially implemented on a computer, an electronic equipment, or the like. The present system is suited for real-time authentication.

In a first means audio input is processed. Part of initial processing may relate to improving an audio signal, such as by removing noise, echo, flanger, unusual and atypical sound, phaser, etc. Further a signal may (partly) be attenuated or boosted to produce desired spectral characteristics (equalized). Also typically filtering of a signal is used, in order. to emphasize or frequency ranges, such as by use of low-pass, high-pass, band-pass or band-stop filters. Compression may be used to reduce a dynamic range of a sound. A reversed pitch shift may be used to identify lower or upper harmonics and/or resonators. Time stretching, time shortening, and modulation may be used as a further characterization of the signal. Care is taken to process audio input in such a way that in any given situation more or less the same processed audio signal is provided.

Typically at least two means for receiving audio inputs are provided, as input may be received in different locations. However, if input is received at a same location in a consecutive mode, e.g. at different times, one means of course will be sufficient. A similar reasoning also applies for the means for receiving optical input.

In a second means optical input is processed. Similar processing techniques as above for the audio signals may be used for the optical signals. Care is taken to process optical input in such a way that in any given situation more or less the same processed optical signal is provided.

It is preferred to process and capture audio and optical input in combination, if possible. In a first situation the combined input may be captured by a computer of a user at home, in a second situation the combined input may be captured by a video camera observing a user in a public environment or in a confined environment, such as an assessment.

The processing of the input is typically performed by a processor, such as a chip, a computer, and a dedicated apparatus. The processor typically has sophisticated circuits for processing, and typically some software for controlling circuits. The processor provides output that may be further processed, such as by the present authentication comparator. A user of the system, which may or may not be present when receiving input, may use the output generated by the system, e.g. in order to authenticate a person, and to compare authentication in a first situation to authentication in a second situation. As such, with a certain uncertainty, it can be established if an authenticated person in the first situation is the same as an authenticated person in the second situation. The present system is directly aimed at authentication of persons at substantially the same time in a given location, but rather at authentication of a large number of persons over time and in given locations, such as 1,000 persons, typically 10,000 persons or more.

In order to process the input an audio signal processing unit, an optical signal processing unit and an identity code capturer are provided. The two optical processing units may be one and the same. These units may be stored on a computer, may be present as such, may be in the form of software, and combinations thereof.

Part of the audio processing unit relates to automatic speech recognition (ASR). In general it is noted that Automatic Speech Recognition (ASR) is already quite challenging for native speech, but it is even more challenging for non-native speech, since non-native speech deviates substantially from native speech in at least three aspects: the sounds, lexicon, and grammar differ. The present ASR technology differs in many ways from others. As a consequence e.g. a word error rate (incorrect determined words)of the present system is 5%-20%, as has been established upon evaluating the system with a significant number of users.

As explained above one or more audio analyzers, and likewise processors, optionally combined in one unit, are present.

In order to verify if audio input is "correct", e.g. as expected, at least one error detector is present. Such a detector may signal errors which preferably are not taken into account when authenticating.

The processed audio (and likewise optical) input relating to information of a person is stored on a data storage means, such as a memory. Data may be stored as such, as a representation of data, such as in an n-dimensional vector space, as characterizations' of a person, and a combination thereof.

Similar to the audio analyzers, an optical processing unit is provided. Also an identity code capturer is provided, having a same functionality as the optical processing unit; it is noted that the identity code capturer is capable of processing an image, such as a bar code, a matrix code, a social security number, an identity number, a passport photo, and combinations thereof. Optical data may be stored separately from the audio data, or in the same data storage means . ' Further an authentication comparator for mutual comparing and authenticating input of a first means for audio input and a first means for optical input, with input of a second means for audio input, and a second means for optical input, and with input of an identity code capturer, respectively, is provided. In an example the first audio and optical means are aimed at establishing ownership and inherence characteristics of a person, typically in a first situation, whereas the second audio and optical means are aimed at establishing knowledge, inherence and optionally ownership, characteristics of a person, typically in a second situation. In the second situation knowledge characteristics are added to the system, whereas inherence and optionally ownership characteristics are used for authentication.

Thereto an authentication comparator is provided, for comparing e.g. input of a first situation with input of a second situation, the inputs being processed. An inherence characteristics that may be compared and scored is pronunciation.

The output is preferably provided in a visual manner, such as on a monitor.

The present system is provided with a means for receiving audio input. The input is typically provided by the user, the user reading out loud a (target) text, the text being provided by the present system, giving an answer to a question posed, etc., such as in the form of spoken language. The target text and the like may be provided by a virtual agent. The present system may provide prompts. As such a user may select to repeat an exercise, hear back his/her own input, be provided with an example input, continue, etc. The example input may also be provided as a randomly provided sequence of words, which require a user to return a correct syntax. A typical length of the present input is 10-250 phonemes, such as 50-100 phonemes.

The present (first and second phase) automated speech recognition software (ASR) may consist of a decoder (a search algorithm) and three 'knowledge sources': a language model, a lexicon, and at least one acoustic model. The language model (LM) contains probabilities of words and sequences of words. Acoustic models are models of how the sounds of a language are pronounced. The lexicon contains information on how the words are pronounced.

The present system may further comprise a first means for determining and analyzing input. In view of e.g. authenticating it may be important to determine what word(s) were actually spoken. The first means may relate to a first phase (automated) speech recognition software, which software typically determines input in a tolerant mode, e.g. globally checking given (or actual) input versus required (target)input (the provided target text). A goal thereof is to recognize words a user intended to pronounce, even though the non-native speech of a user may deviate in various ways. The ASR system is optimized for this phase, e.g. by tuning the three knowledge sources using non-native speech.

The output of the first phase speech recognition software may provide input to the second phase speech recognition software (or in an alternative, vice versa).

The system may further comprise a second means for determining input, such as second phase (automated) speech recognition software comprising a .pronunciation quality evaluation unit for processing input to determine potential difference between target pronunciation and actual pronunciation, which unit functions in a detailed and strict manner. The manner may depend on the level of the user. The output of the first phase may be used as input, as well as the non-processed captured input. The differences identified, if any, may further be used to authenticate a person, as these differences characterize such a person further.

In the second phase the system is strict. Now a goal is to detect differences, such as large deviations between pronunciation received and target pronunciation. A further version of the ASR system is used which is optimized for this task. The ASR system then segments the non-native speech signal, it detects the position (begin and endpoint) of the words and the phonemes (sounds).

The system may further comprise various error detectors. These detectors relate to one or more of sounds and phonemes, lexicon, grammar, and prosody. Examples are a pronunciation error detector, a prosody error detector, e.g. a word stress error detector and an intonation error detector, a respiration error detector, a formant error detector, and a grammar error detector, e.g. a morphology error detector and a syntax error detector, an interaction error detector, and a lexicon error detector. Typically these detectors are optimized, e.g. in view of first and second language, such as Dutch. The errors, or differences, identified, if any, may further be used to authenticate a person, as these differences characterize such a person further.

The system may further comprise a selector for selecting a first phase speech recognition software version and/or a second phase speech recognition software version, the version(s) being optimized for a group of users. As such a user or a teacher may set a software version being specifically adapted to a level of oral language proficiency of a user, adapted to a native language of a user, adapted to a variety or dialect of a user, and combinations thereof.

Such further optimization can be used for authentication, especially if (characterizing) differences between persons are otherwise relatively small.

The present software and detectors are stored. They may be stored in any means capable of storage of binary data, such as RAM, a ROM, a hard-disk, a CD, a DVD, etc., and combinations thereof. The stored data should be accessible to the present system, when in use. It is noted that various elements of the present system may be located within one location, even within one apparatus, such as a computer, wherein e.g. software is loaded on memory, or located at different locations, such as on the internet, on a mobile phone, on a computer, at a learning center, and combinations thereof. Within e.g. a combination a first element may function as a client to a further element, an element may function as a server, etc. For some applications it is preferred to use a server or a cloud. A user may interact with a server or cloud as a client, e.g. a browser based client. Preferably a broadband connection between client and server or cloud is used, enabling fast communication of data.

The present system may be accessible on internet, on a hard disk of a computer, on a DVD, a CD-ROM, etc.

Thereby the present invention provides a solution to one or more of the above mentioned problems, by providing an extended system, comprising various functionalities, wherein the functionalities are further optimized with respect to each other, thereby further improving functionality and user friendliness .

Advantages of the present description are detailed throughout the description.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates in a first aspect to an automated audio-optical system for user identity authentication according to claim 1.

In an example the present invention relates to a system wherein personal characteristics of a user are stored in a first separate domain, and/or wherein the authentication comparator is stored in a second separate domain, and/or wherein audio and optical information is stored in a third separate domain, wherein the first, second and third domain can be linked by an secret key, wherein the secret key is stored in a fourth domain, wherein the fourth domain is preferably only accessible to an administrator upon entering a user ID and a code. As such privacy of personal data is safeguarded.

In an example the present invention relates to a system further comprising stored on the system yl) first phase speech recognition software for determining audio input in a tolerant mode, wherein the input is in the form of a word, a sentence, and the like, wherein a typical length of the present input is preferably 10-250 phonemes, such as 50-100 phonemes, the first phase speech recognition software preferably providing input to second phase speech recognition software, and y2) second phase speech recognition software for determining audio input in a strict mode, comprising a pronunciation quality evaluation unit for processing input to determine potential difference between stored target pronunciation and actual audio input pronunciation, and for generating feedback output. As such especially audio input and processing thereof is improved, e.g. in a reduction of error rates of the present system.

In an example the present invention relates to a system further comprising one or more of e) v) a word stress error detector, vi) a morphology error detector, vii) a syntax error detector, viii) an interaction error detector, ix)an intonation error detector, x) a respiration error detector, xi) a formant error detector, and xii) a selector for selecting a first phase speech recognition software version and/or a second phase speech recognition software version, the version (s) being optimized for a group of users. The various error detectors and selector further contribute to the robustness of the present system.

In an example the present invention relates to a system wherein input and/or output are in a second language and the user being native in a first language, wherein the first and second language are selected from Indo-European languages, such as Spanish, English, Hindi, Portuguese, Bengali,

Russian, German, Marathi, French, Italian, Punjabi, Urdu, Dutch, German, French, Spanish, Italian, Sino-Tibetan languages, such as Chinese, Austro-Asiatic languages, Austronesian languages, Altaic languages, such as wherein the first and second language are Dutch and English, Dutch and

German, Dutch and Spanish, Dutch and Chinese, German and English, French and English, Chinese and English, preferably wherein the second language is a foreign language such as English, and vice versa.

In an example the first language may be- selected from Dutch, German, French, Spanish,' Italian, Polish, Chinese, Japanese, Korean, Afrikaans, and English.

In an example the second language may be selected from Dutch, German, French, Spanish, Italian, Polish, Chinese, Japanese, Korean, Afrikaans, and English.

In an example the present invention relates to a system wherein the authentication comparator further comprises an audio subtractor for subtracting a (part of a) first audio input from a (part of a) second audio input, an optical subtractor for subtracting a (part of a) first optical input from a (part of a) second optical input, and for subtracting a (part of a) first optical input from a (part of a) identity code input. By subtracting an optional difference can be detected. Based on a difference found it can be determined if the difference is significant/measurable, or insignificant/not measurable. A significant difference indicates authentication has failed, whereas an insignificant difference indicates authentication has been successful.

In an example the present invention relates to a system wherein the authentication comparator further comprises one or more of a pronunciation subtractor, a language proficiency subtractor, a communication ability subtractor, a correctness of answer subtractor, a user capability subtractor, a job experience subtractor, an education subtractor, a grade subtractor, etc. The various subtractors may be used to identify further differences or absences thereof. Therewith robustness of the comparator is further improved.

Also varieties of the above languages may be selected, such as British English, American English, Australian English, Canadian English, New Zealandian English, Indian English, etc.

Further, the present system is also adapted to process dialects, such as Dutch dialects and varieties, such as wherein the pronunciation quality evaluation unit is adapted for one or more (language) varieties and/or dialects, such as British English, Limburgs, Brabants, Gronings, and Drenths. Clearly such can only be achieved after gathering data, analyzing data, ordering data, etc. as described throughout the description. The processing of dialects and/or varieties may further be used to authenticate a person, as the use of a specific dialect and/or variety characterize such a person further .

In an example the present invention relates to a system wherein the pronunciation quality evaluation unit comprises software, wherein the software is preferably being stored on a computer.

In an example the present invention relates to a system further comprising one or more of a language model, a lexicon, a phoneme model, one or more thresholds, one or more probability criteria, one or more random number generators, a level adjustment set-up, and a decoder, wherein the decoder may comprise the previous elements.

In an example the present invention relates to a system further comprising one or more of a reference set of parameters, a fine-tuning mechanism, a self-learning algorithm, a self-improvement algorithm, and a selection means for selecting criteria. The parameters may for instance relate to one or more classifiers, as well as to (implementing) algorithms, e.g. for determining a probability.

In an example the present invention relates to a system further comprising a data base, wherein data is stored for one or more of pronunciation, word stress, intonation, and phoneme segmentation. It is noted that the present data base comprises an extensive amount of data, gathered throughout the years .

In an example the present system further comprises one or more decision trees, stored on the system, such as a. decision tree being adapted to provide questions and responses thereto, a decision tree being adapted to provide purposive training in view of second phase speech recognition. An example for a decision tree is a job interview. A user is e.g. asked (general) questions relating to various aspects of the job and towards the users background. An example may relate to a route to be followed, e.g. towards a museum in a city. In general the decision tree may relate to a Quest. As such a user "moves" through a decision tree and progress of a user can be monitored. The interaction becomes much more vivid.

The present invention relates in a second aspect to a method for automatic real time user authentication according to claim 10.

In an example of the present method further a normalized score of authentications may be provided, such as for monitoring and evaluating.

In an example of the present method provides monitoring scores of users and relation between one or more users in a sequence of users.

In an example the present technology is used in assessment, serious gaming, for ranking,

EXAMPLES

The invention is further detailed by the accompanying example, which is exemplary and explanatory of nature and are not limiting the scope of the invention. To the person skilled in the art it may be clear that many variants, being obvious or not, may be conceivable falling within the scope of protection, defined by the present claims.

SUMMARY OF FIGURES

Figure 1 shows an example of a functional flow diagram of the present system.

DETAILED DESCRIPTION OF THE FIGURES

Figure 1 shows an example of a functional flow diagram of the present system. Therein two phases can be identified: (1) An enrolment phase, and (2) an assessment phase.

In the enrolment phase, a photo (ID) is scanned and a video is recorded. During video recording a user reads a text presented on a screen. Both the photo and the video recording are stored in separate databases. From the video recording a photo and an audio fragment may be extracted, respectively.

During the assessment phase therein, assessment items (exercises) are retrieved from a database, the items typically requiring a response (answer, repetition, etc.) from a user. A user may record his/her responses through a microphone. The responses as well as the audio input are stored. For every item an item score is -calculated and stored automatically, typically in a separate (scoring) database. Furthermore, the audio recordings are stored in a (further and separate) database. When the assessment is finished, an overall assessment score (AS) is calculated.

Based on three data sources obtained so far, namely data source 1 (Dl) the scanned photo (enrolment), source 2 (D2) the recorded video (enrolment), and source 3 (D3) the recorded audio (assessment) a similarity score (SS) is calculated. Such a SS can be calculate by combining (1) the similarity score of Dl and extracted stills from D2 using face verification technology, and (2) the similarity score of D3 and extracted audio from D2 using speaker verification technology.

The SS and the AS are linked and stored in a database. Results are then sorted on AS in descending order. Further users with an SS below a certain threshold can be indicated. Also potential intruders, that is not identified before, are indicated. Any user is double checked, such as manually by inspecting and comparing Dl, D2 and D3.

Claims

An automated audio-optical system for user identity authentication comprising a) at least one means for receiving audio input, such as a microphone, and transferring the audio input into an electrical audio signal, b) at least one means for receiving optical input, such as a camera, and transferring the optical input into an electrical optical signal, the input being a combination of audio and optical input, such as video input, and on-line input, c) a processor for collecting and processing audio and optical input and for providing output, preferably a digital signal processor, such as a computer, the processor comprising d) an audio signal processing unit for providing an audio characterization comprising i) automatic speech recognition, ii) one or more audio analyzers, such as an audio spectrum analyzer, a bandwidth analyzer, a sound analyzer, and a power analyzer alysator, iii) optionally an error detector, and iv) a data storage means for storing information from a person, and e) at least one of an optical signal processing unit and an identification code collector for providing an optical characterization, each comprising 1 a biometric face analyzer for identifying a person, the biometric face analyzer comprising a set of parameters for characterizing facial features, such as eyes, nose, lips, hair, ears, landmarks, thresholds for these parameters, and optionally a face database, a data storage means for storing information from a person, f) an authentication comparator for mutually comparing and authenticating audio input and / or optical input in a first situation, with audio input and / or optical input respectively in a second situation ,, and / or input of an identity code collector, where authentication compara preferably comprises a pronunciation score, and g) at least one means for providing output to the user, such as a speaker for providing audio feedback and a monitor for providing visual feedback.

An automated system according to claim 1, wherein personal characteristics of a user are stored in a first separate domain, and / or wherein the authentication comparator is stored in a separate second domain, and / or wherein audio and optical information is stored in a third separate domain, where the first, second and third domains can be linked with a secret key, the secret key being stored in a fourth domain, the fourth domain preferably being accessible only by an administrator after entering a user name and a code.

An automated system according to claim 1 or 2, wherein the authentication comparator further comprises an audio subtractor for subtracting a (part of a) first audio input from a (part of a) second audio input, an optical subtractor for subtraction of a (part of a) first optical input of a (part of a) second optical input, and for subtracting a (part of a) first or second optical input from a (part of a) identity code input.

A system according to any one of the preceding claims, further comprising one or more of a pronunciation subtractor, a language proficiency contractor, a communication proficiency subtractor, a response correctness subtractor, a user proficiency subtractor, an orbit experience contractor, a training subtractor, and a training level subtractor.

A system according to any one of the preceding claims, further comprising one or more of a reference set of parameters, a fine-tuning mechanism, a self-learning algorithm, a self-correcting algorithm, a selection means for selecting criteria, a database, in which data are stored for one or more more of the pronunciation, stress, intonation, and phoneme segmentation.

A system according to any preceding claim, further comprising one or more decision trees, such as a decision tree adapted to provide questions and responses thereto.

A method, using a system according to any of claims 1-6, for automatic real-time user authentication comprising one or more of i) verification of oral language proficiency, ii) verification of a user's identity, iii) verification of voice characteristics of a user, iv) verification of a user's facial features, v) authentication of a user's communication skills, vi) verification of a user's intellect, vii) verification of a user's character, and viii) authentication of a user's motivation .

The method of claim 7, further comprising providing a normalized score of authentications.

The method of any one of claims 7-8, further comprising monitoring user scores and the relationship between one or more users in a series of users.

10. Method according to one of claims 7-9, for use in assessment, in serious gaming, and for ranking.