WO2007128975A2

WO2007128975A2 - Biometric security systems

Info

Publication number: WO2007128975A2
Application number: PCT/GB2007/001274
Authority: WO
Inventors: Kenneth Revett
Original assignee: University Of Westminster
Priority date: 2006-04-10
Filing date: 2007-04-05
Publication date: 2007-11-15
Also published as: GB0607161D0; GB2437100A; WO2007128975A3

Abstract

A method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a user identity and password for access security, the method incorporating a multi-tiered approach wherein various processes may be employed selectively in sequence to provide a desired degree of security, while controlling false acceptance/ rejection rates. Processes which may be employed are as follows: statistical measures based on coefficients of variation, and entropy measures representing the degree of disorder in login attempts; a multiple sequence alignment based process for extracting unique features within the keystroke dynamics of the user's enrollment data an unsupervised Artificial Immune System process for assessing the key dynamics of a user's login attempt by reacting antibody classifiers with selected key dynamics of the login attempt, and assessing a level of correlation representing reactivity; a supervised Learning Vector Quantisation neural network classification process; and a Power law of Practise process that takes account of changes in a user's typing style.

Description

BIOMETRIC SECURITY SYSTEMS

The present invention relates to biometric security systems for authenticating the user of a keyboard, key pad, and like key devices for entering characters into a data processing device

Background Art

Behavioural based biometric based software systems are known that enhance the security level of C2 based computer and related systems (i.e. keypad devices at ATMs etc). Class C2 is a security rating established by the U.S. National Computer Security Center (NCSC). C2 based security systems are the most ubiquitous class of secured computer systems - which rely solely on a user ID and password for access security. The underlying principle of behavioural biometric software systems is that the way the user enters his login details (login ID and Password) contains valuable information regarding the actual identity of the individual possessing these login details. Authentication is a 2-step process: firstly, the authentication details required to login in to a trusted computer system (or any system that provides electronic entry) must be produced and secondly, it must be established that the possessor of said authentication details is the authentic owner.

There are commercial products that address this problem. All of these products claim to enhance C2 security systems to varying degrees. The basis for performance in the industry is relies principally on two metrics: false acceptance rate (FAR) and false rejection rate (FRR). The FAR is measured as the number of successful attempts by illegitimate owners of the authentication details. The FRR is a measure of how often the authentic owner of authentication details is rejected from entering the system. Ideally, one would like a low (approaching 0%) FAR with a similar value for the FRR. Another derived metric, the equal error rate EER (sometimes referred to as the cross over error rate (CER)) can be applied which quantifies the point at which the FAR and FRR intersect each other when plotted on the same graph (see Figure

1). The product Biopassword is based on US Patent 4,805,222, which discloses verifying an individual's identity based on his keystroke dynamics, principally key pressure and time periods between keystrokes (termed digraphs). A representation of the keystroke dynamics of an individual is stored, and compared with the representation of the dynamics of a person seeking access.

The product Psylock is based on European Patents EP 0917678, and EP 1026570, which disclose comparison of a reference vector representing keyboard dynamics with a vector generated when a user seeks access. This requires an enrollment process requiring the entry of a lengthy string (at least 400 termed a passphrase) of characters. The comparison is based on an algorithm, which is independent of the sequence in which characters are entered.

Improvements are desirable in the accuracy attained by the above products.

US Patent No. 6,442,692 discloses an authentication system having a microcontroller embedded in a keyboard and which assesses keyboard dynamics, which are independent of typing text.

U.S. Patent No.6,901 ,145 discloses a keystroke feature authentication system, wherein a share table contains a row for each keystroke feature. The parameter is compared with a threshold and the results are stored. A history of legitimate authentications is kept. If a keystroke feature consistently produces the same comparison result, then it is considered a distinguishing feature.

Bergadano et al. "User Authentication through Keystroke Dynamics", ACM Transactions on Information and System Security, Vol.5, No.4, November 2002, pages 367-397, discloses an edit distance measure to determine how closely a given input sequence is compared to a reference string that is obtained during an enrollment process. The edit distance algorithm calculates the number of changes required to order the trigraphs in the input string with respect to a reference string generated during enrollment. A threshold value is generated which serves to automatically determine whether a user input is close enough to the reference vector for acceptance. The following papers are cited: Kenneth Revett, Sergio Tenreiro de Magalhaes, Henrique M. D. Santos: Enhancing Login Security Through the Use of Keystroke Input Dynamics. ICB 2006: 661-667, January 2006; Revett, K., Tenreiro de Magalhaes, S. & Santos, H.: DataMining a Keystroke Dynamics Biometrics Database Using Rough Sets, 12th Portuguese Conference on Artificial Intelligence (EPIA'05), Workshop on Extraction of Knowledge from Databases and Warehouses (EKDB&W 2005), Covilha, Portugal, 5-8 December, 2005 pp. 188-191 ; Sergio Tenreiro de Magalhaes, Kenneth Revett, and Henrique M. D. Santos, Password Secured Sites - Stepping Forward With Keystroke Dynamics, International Conference on Next Generation Web Services Practices (NWeSPO5),Seoul, Korea, 22-26 August, 2005; Revett, K. and Khan, A., 2005, Enhancing login security using keystroke hardening and keyboard gridding, Proceedings of the IADIS MCCSIS 2005., pp 471-475, April 19-23, 2005; Kenneth Revett, Sergio Tenreiro de Magalhaes, and Henrique M. D. Santos, Critical Aspects in graphic Authentication Keys, International Conference oni-warfare, University of Maryland Eastern Shore, 15-16 March, 2006, pp 212-217. In general, these papers disclose an authentication system wherein a profile or reference vector summarising keystroke dynamics is generated during an enrolment process (whether via the keyboard or through a graphical mouse driven interface), for use in a verification procedure for a subsequent login ID/Password entry. The verification is based on statistical measures, and a primary parameter generated is coefficient of variation of latency times between digraphs. Evolving typing styles of users over time is taken into account.

The use of machine learning techniques and neural networks in assessing keyboard dynamics has been proposed: see for example: Revett K., et al. Developing a Keystroke Dynamics Based Agent Using Rough Sets, The 2005 lEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology Workshop on Rough Sets and Soft Computing in Intelligent Agents and Web Technology, Compiegne, France, 19-22 September, 2005; Obadiat, M.S. and Sadoun, B.: A Simulation Evaluation Study of neural Network techniques to Computer User Identification, Information Sciences, VoI - A -

102, pp 239-258, 1997: Paula M.V.S, Kinto, EA, Hernandez, E.D.M., & Carvalho T.C.M., User ldenification based on Human Typing Pattern with Artificial Neural Networks and Support Vector Machines, 25^th Congress of the Brasilian Computer Society, pp 484-493. Haider et al, "A Multi-Technique Approach for User Identification through Keystroke Dynamics", IEEE International Conference on Systems, Man and Cybernetics, 2000, Vol. 2, pages 1336-1341 discloses the use of a suite of techniques for password authentication using neural networks, fuzzy logic, and statistical methods. Typing speed, as measured by digraphs between adjacent characters of the password, is used as the biometric. False Acceptance Rates and False Rejection Rates are measured for the various techniques.

Summary of the Invention

It is an object of the invention to provide a method and apparatus for a biometric security system for authenticating the user of a key device for entering characters into a data processing device, with a high degree of authentication accuracy. It is a further object of the invention to provide a method and apparatus for a biometric security system for authenticating the user of a key device for entering characters into a data processing device, which has a speedy login or enrolment procedure. The present invention is comprised in a biometric security system that has two main operating processes: a first enrolment process in which user data is acquired from features of a user's operation of a keyboard, and a second login process where the user is authenticated by a comparison process of the user data acquired from enrolment with features of the user's login attempt. In the present invention, a convenient measurement value that may be adopted for assessing characteristic features of a user's keyboard dynamics is digraph or trigraph latencies, by which is meant a time interval representative of the sequential pressing of two or three keys respectively. Other values that may be employed are dwell time for which a key remains depressed, scan codes (ASCII codes) which are assigned to all characters and symbols on each key of the keyboard, pressure upon a key when depressed, and other measures that are explained below or that will be apparent to the person skilled in the art. In the present invention, statistical estimates are generated from recorded measurement values, and these may be employed for a base level assessment of a log in attempt, which is a highly secure assessment.

In addition, estimates may be generated based on an entropy function, representing the degree of disorder or variability in a user's typing style. There are various functions that can provide such a measure, and we may employ a measure from information theory that represents the information capacity within an action, and is known as Shannon's entropy: this is explained in more detail below. Other measures may be employed, for example based on the concept of an edit distance, as explained below. The entropy value may be employed directly or as an adjunctive measure to the statistical estimates, and gives additional flexibility in the assessment process of a login attempt. In accordance with the invention, one or more machine learning algorithmic techniques may be employed to assess the data. Machine learning is a broad area of artificial intelligence concerned with the development of techniques that allow computers to "learn" how to perform specific tasks such as classification in this case. Thus the invention may comprise a hierarchical authentication process that works at a plurality of levels. An issue with assessing a login attempt using only statistical measures, is that whilst it gives a highly secure authentication, it may result in rejection of an authentic user, because of one or more unexpected significant variations in his login attempt may cause threshold values to be exceeded. Even an assessment based on entropy functions may not be able to deal with this issue. The value of a machine learning technique for assessment is that it can deal with major variations from the expected characteristics of an authentic user's login attempt, without causing rejection, because the assessment is based more on a collection of individual features that together characterise the user, rather than measures based on an overall assessment of the user's technique. Thus if one characteristic feature is not present, because of an unexpected variation occurring in a log in attempt, nevertheless the log in attempt may be accepted if other characteristic features are present. Such a technique therefore reduces the risk of false rejection, and hence lowers the FRR. There may however be an increased risk of false acceptance, which can be ameliorated by combining various techniques which we describe more fully in this document. Neural networks may desirably be employed in the machine learning assessment. Desirably in accordance with the invention, a Learning Vector Quantisation (LVQ) process is employed where respective clusters of failed and accepted login attempts are collected, and for a new login attempt, the Euclidean distances are computed between the two clusters. The cluster that the login attempt is closest too is deemed the winning cluster and the associated label of that cluster forms the decision class output. This technique is computationally inexpensive, requires only a quick training period, and is highly accurate in most applications.. A disadvantage of LVQ is that it cannot be deployed immediately after enrolment of a user, because it requires a collection of invalid login attempts, in addition to the original enrolment process, so that the algorithm may be "trained" to categorise attempts as either failed or accepted. Such an algorithm, requiring a training process is termed a supervised algorithm. In accordance with the invention, a further machine learning algorithm is employed that is unsupervised, and does not require to be trained on examples of failed login attempts. Whereas algorithms such as clustering algorithms, or self organised maps may be employed, desirably, in accordance with the invention, an algorithm may be employed from the area known as bioinformatics. Specifically, an Artificial Immune System (AIS) algorithm may be employed, as described in more detail below, that does not require examples of failed login attempts. AIS computer algorithms are inspired by the principles and processes of the vertebrate immune system. The algorithms typically exploit the immune system's characteristics of learning and memory to solve a problem.

An AIS algorithm may accept as input entropy values determined from the log in attempt. Alternatively, and as described in more detail below, the AIS algorithm may accept as inputs data representing bioinformatics motif patterns, representative of features of the user's keyboard style. In addition, a bioinformatics process may be employed to generate entropy values.

A Power Law of Practise (PLP) step may be deployed, to take account of evolvement of typing style with repeated logins. PLP reflects the reduction in performance time resulting from practise. In our implementation, we plot the log of the task time number versus the log of the performance time. The slope of this function provides us with a quantitative measure of the performance enhancement produced by practise. This measure is used for adjusting thresholds in the assessments of a login attempt, and may be employed as a separate assessment step. This in accordance with the invention, a multitiered process may be employed to assess a login attempt. Initially an AlS process is deployed to check for characteristic features of a users keyboard style. An LVQ process may be deployed after repeated log in attempts to give greater security. If a greater level of security is required, an assessment based on entropy values may be employed, conveniently based on calculation of Edit distances. Finally for an even higher level of security, an assessment is made based on absolute statistical measures. Thus a very flexible system is created which may give initially a very low FRR, and with repeated deployments of higher level processes may give greater security whilst controlling FAR and FRR. Hence in a first aspect, the invention provides a method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string during the login procedure , the method comprising: an enrolment process wherein a set of data is determined representing characteristics of key dynamics of the user; said characteristics comprising at least one of digraphs, trigraphs, dwell times, scan codes, and assessing the key dynamics of a user's login attempt by comparison with said set of data, comprising providing at least the following processes: a) a process comprising a statistical estimate of a user's login attempt; b) a process comprising an entropy estimate based on the degree of disorder in the login attempt; and c) a process comprising at least one machine learning technique where an assessment is made as to whether characteristic features of a collection of characteristic features are present in the login attempt; and depending on the level of security required, deploying any or all of processes a) , b) and c) for assessing a login attempt.

In a further aspect, the invention provides a method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the method comprising: an enrolment process wherein user data is determined from a plurality of enrolment login attempts .representing characteristics of key dynamics of the user, wherein in an initial step of the enrolment process, selected values of the user's keyboard entry are assigned to intervals in a range of values, and each interval is assigned a denoting symbol, and wherein a multiple sequence alignment process is deployed wherein repetitive motif patterns are determined in said denoting symbols by a multiple sequence alignment process. and assessing the key dynamics of a user's login attempt by comparing characteristics of the login attempt with said repetitive motif patterns..

In a further aspect, the invention provides a method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the method comprising: an enrolment process wherein user data is determined from a plurality of enrolment login attempts, representing characteristics of key dynamics of the user; providing an Artificial Immune System process, and generating a strings of data items, each string representing a sequence of characteristic features found in the dynamics of an enrolment login attempt, each string being denoted as a collection of one or more MHC molecule(s), and generating at least one antibody from a sequence of at least one MHC molecule t, said antibody comprising a string of numerical values related to the numerical values of the or each MHC molecule ; and assessing the key dynamics of a user's login attempt by locating MHC molecules therewithin , and applying said antibody to the located MHC molecules to assess a level of correlation therebetween.

In a further aspect, the invention provides apparatus for authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the apparatus comprising: an enrolment means storing a set of data representing characteristics of key dynamics of the user, said characteristics comprising at least one of digraphs, trigraphs, dwell times,; means for assessing the key dynamics of a user's login attempt by comparison with said set of data, comprising providing at least the following assessment means: a) an assessment means comprising a statistical means for estimating a user's login attempt; b) an assessment means comprising an entropy means for estimating a login attempt based on the degree of disorder in the login attempt; and c) at least one machine learning means for assessing whether characteristic features of a collection of characteristic features are present in the login attempt; and means for deploying any or all of means a), b) and c) for assessing a login attempt, depending on the level of security required.

In a further aspect, the invention provides apparatus for authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a character string as a login procedure, the apparatus comprising: an enrolment means storing a set of data representing characteristics of key dynamics of the user; said user data including selected values of the user's keyboard entry assigned to intervals in a range of values, and each interval is assigned a denoting symbol, and repetitive motif patterns determined in said denoting symbols by a multiple sequence alignment process; and means for assessing the key dynamics of a user's login attempt by comparing characteristics of the login attempt with said repetitive motif patterns. In a further aspect, the invention provides apparatus for authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the apparatus comprising: an enrolment means storing user data determined from a plurality of enrolment login attempts, representing characteristics of key dynamics of the user; an Artificial Immune System means for generating a strings of data items, each string representing a sequence of characteristic features found in the dynamics of an enrolment login attempt, each string being denoted as an MHC molecule, and means for generating at least one antibody from a sequence of at least one MHC molecule extending over at least part of an enrolment login attempt, said antibody comprising a string of numerical values related to the numerical values of the or each MHC molecule ; and means for assessing the key dynamics of a user's login attempt by locating MHC molecules therewithin , and applying said antibody to the located MHC molecules to assess a level of correlation therebetween.

In at least preferred forms of the invention, features of the invention are as follows:

• employs a hierarchical authentication scheme that works at several complementary levels

• deployment of a neural network when a collection of non-authenticated samples (either imposter or failed authentic attempts) and authentic login attempts have been generated - trained using a random sample of known authentic login attempts and all available invalid login attempts • is able to work as a classifier in an on-line fashion

• employs an artificial immune system algorithm for part of the authentication algorithm in an on-line manner without the need for imposter samples • employs a bioinformatics based edit string and multiple sequence alignment algorithm for motif extraction used in the artificial immune system and entropy determination which are used in the authentication process

• enhances the individuality of a user's details during enrolment by ensuring that the first input trial has sufficient variation and then requires that the consistency after the 3^rd entry onwards is sufficient during the remainder of the enrolment process.

• entropy of the login attempt is calculated and is used in the authentication process • requires that the login ID/password selected by the user contains a sufficient level of unique with respect to digraphs - the system requires that the selects details that contain at least 50% unique digraphs

• The system maintains an input history for each login ID/password combination that is updated after each successful login attempt. All required measurements such as variances, CV, and entropy etc. are updated to reflect the current input typing style of the authentic owner.

• employs the Power law of Practise as part of the authentication process by measuring how typing speed for the login ID and password (treated separately) has changed with repeated use of the system. • uses absolute values of digraphs/trigraphs/durations as part of the authentication process, via a statistical measure directly related to the coefficient of variation (CV) of these same measures acquired during the enrolment process

Brief Description of the Drawings

A preferred embodiment of the invention will now be described with reference to the accompany drawings wherein: -

. Figure 1 is a diagrammatic representation of FAR, FRR and EER ; Figure 2 is a schematic representation of digraph and trigraph latencies; Figure 3 is a schematic view of a partitioned keyboard for maximal dispersion, shown as an example as divided into four regions as indicated by the shades of grey.; Figure 4 depicts diagrammatically an antibody binding with an antigen

(MHC molecules in this algorithm) The MHC corresponds to regions of the input string that are similar across all or the majority of the enrolment samples.

The antibodies are tuned to match the resulting MHC molecule(s) that are specific to a given set of MHC molecule(s) per enrolment string.

Figure 5 depicts the concept of the edit distance as employed in this invention. This particular example yields an edit distance of 6 - which represents the number of changes required to perfectly align the two patterns;

Figure 6 depicts a table of digraphs collected during a 10 entry enrolment transformed into the amino acid alphabet for a given login ID: qxtjfa and password atbeus. Each column represents the same digraph and each row is a complete entry;

Figure 7 depicts the result of the multiple sequence alignment for a given login ID/password sequence, for login ID: qxtjfa and password: atbeus. The '-' indicate no match situations. In this particular example, the first string yields 3 separate MHC molecules and the second 2 unique MHC molecules.; and

Figure 8 is a flow chart of deployment of a series of algorithms in accordance with the invention. A 'No' result indicates a possible point of authentication failure.

Description of the Preferred embodiment

The present invention, at least in its preferred embodiment, has as its primary aim the task of enhancing C2 based security - through various machine learning techniques and/or statistical methods. The primary task is to verify that the possessor of a login ID/password is the rightful owner of said details. This is accomplished through a series of verification procedures which are enabled at the login site. A login site could be a PC remotely accessing a web based service, a PC in a networked office/business enviroment, any keypad based installation, mobile devices and personal desktop assistants. The only constraint with regards to the application is that the device should have sufficient memory to store and a processor to execute the programme. The present invention provides a multi-layered authentication process A combination of novel machine learning approaches and a multi-layered authentication approach ensures a high level of security in a software only biometrics system. The preferred embodiment of the invention extracts information from the users when they enrol into the system. The login ID and password have a minimal length of 6 characters and symbols and no upper bound. This system could also be employed where a passphrase is used. A passphrase is simply a collection of characters and symbols that form a particular phrase that can be used to validate access to a trusted computer system (such as a C2 based security system). Provided there were sufficient numbers of characters in the phrase to generate repeated digraphs/trigraphs, this, a passphrase would be adequate for enrolment. The default version employs login ID/password for enrolment. Each user is required to enter a login ID/password combination during enrolment (a default value of 10 times) until a combination of the following criteria have been met: i) entropy threshold and ii) consistency threshold. By an entropy threshold, we mean that the variability between keystroke entries (as measured by variance in digraph/trigraph times and duration for all keystrokes used in forming the login ID/password) contains sufficient variability (i.e. is above a threshold value). This measure is captured through the CV for the digraphs/trigraphs that were extracted from the particular characters forming the user's login ID/password. We do not wish users to monotonically enter their login id/password as this may reduce the protection afforded by our system by enabling hackers to simulate the user input provided. A default level of variability is employed - 0.20 (as measured by the coefficient of variation - a direct measure of variance normalised by the mean value), and is adjustable in that we may be able to raise/lower this threshold as the circumstances dictate. In addition, a consistency threshold is used to ensure that the entries of the login ID/password ^' (with respect to digraph/trigraph/duration times) are consistent during the enrolment process. This is a tunable parameter and the initial value is set to 0.50 (again measured by the CV - but this time across the same digraph/trigraph recorded during enrolment process where the user is requested to enter their login ID/password on average 10 times). Starting with the 3^rd input, the consistency is calculated via by measuring the CV between successive entries and a running average is maintained. If the CV is too large, than that entry is rejected and the user is respectfully requested to repeat that entry. Again, this threshold is a tunable parameter, the lower the value the more robust this feature becomes. In addition, ensuring that the login ID/password details contain a minimal level of uniqueness helps ensure that the login details are difficult to re-create by unauthorised users. In very strict security installations, this value can be reduced to ensure that the login data during enrolment is as consistent and reproducible as possible - to an almost arbitrary level of stringency. If a user is not able to enter their login ID/password details with sufficient consistency, then they will be requested to re-enrol with a different login ID/password. If this situation remains, then the enrolment process will go to completion and this account details will be flagged as a potential risk to the system administrator if this is at all possible. Based on the dynamics of how the user enters their login details, a measure of their natural variation is captured through these statistical measures.

Thus, the calculated entropy of this system is based on the following: for the initial enrollment entry (the first entry), the CV is calculated for each of the digraphs/trigraphs. The CV is the standard deviation normalised by the mean. The standard deviation is calculated based on the digraphs/trigraphs - each done separately. For example, using digraphs only, we calculate the sum of all digraphs calculated for the login ID and password separately. We derive the mean, variance, and standard deviation. Then we derive the CV by dividing the standard deviation by the mean. This is repeated for login ID and password separately, and for digraphs and trigraphs separately. If for the login ID and password- the digraph/trigraph CV values are less than a given threshold, it is rejected. The user will be notified if there was a breach of the entropy measure and if so, whether it was for the login ID, password, or both. The threshold is based on a constant factor times the number of characters in the login ID/password - since we have no data at the first entry that can be used to make a personal determination. The use of entropy in the enrollment process is an inherent measure of stability in the enrollment process.

In addition, if there is not a sufficient number of unique digraphs/trigraphs in the login ID/password, then the login ID/password is rejected. We require by default that there be at least half of the digraphs/trigraphs are unique. This prevents the user from entering too many repeated sequences (i.e. login ID: qweqwe and password asdasd). The system requires that the user select a login ID/password that has a variety of characters in it to ensure that the keyboard is well sampled. In the most stringent cases, the system also requires that users select characters that are maximally dispersed across the keyboard - implementing what is termed 'keyboard partitioning.' Here, we partition the keyboard into sections - four - and we require that users select characters in their login ID/password that are not in the same partition consecutively please see Figure 3). This feature is implemented in high level security installations only. In addition, we can maintain a dictionary attack database and check to see if the selected user ID/password is contianed in the database. If the login ID/password are contained within the database, the entry is rejected. We maintain an active dictionary attack database so the information contained is current. The use of the facility is for strictly high level security applications - as it requires the user selects IDs that are more difficult to enter. After the 2^nd enrollment trial, we then calculate the consistency of subsequent entries. To implement this feature, we calculate the CV across each digraph/trigraph value. For each digraph/trigraph, we calculate the CV and if this is above a given threshold, we reject that entry and respectfully ask the user to repeat that entry, indicating that it varied too much from previous entries. This ensures that we have as much inherent variation within a given entry, and low dispersion between entries. After the user has successfully enrolled into the system (which could require 10 or more entries - again this is a system defined parameter - 10 is the default), a reference template is generated for each user. This template consists of the following elements:

1- keypress duration (dwell time) in milliseconds (accurate to within 1 msec) for all keystrokes in the login ID/password

2- all digraphs/trigraphs time for all keystrokes entered for the login ID/password (with 1 msec resolution)

3- scancode for each keystroke entered

4- total length of time for login ID 5- total length of time for password

6- total length of time for both login ID/password - which includes any gap required between login ID/password

7- entropy of the login ID/password 8- the minimum and maximum digraph/trigraph times for both the login ID and password

9- whether any shift keys have been pressed and if so which one (right or left) 10-overall typing speed for login ID/password

11- if multiple digraphs/trigraphs have been entered - the average value of these is calculated

12- learning curve factor (from the Power Law of Practise)

These attributes form a reference vector that is stored for each user. In addition, each time a user successfully logs into the system, the reference vector is updated with the new details - thereby allowing the reference vector to adjust to changes in the users' typing style. The adjustment is performed as a weighted average, which depends on how far away the current login details are from the running average. This allows the adjustment to more accurately estimate the variation/change in typing style of the user. From these attributes, we generate a set of algorithms to authenticate a login ID/password on a 1 :1 basis - that is the current input details are compared with the stored reference vector for the given login ID/password.

We employ a hierarchical authentication scheme where each of the separate authentication mechanisms builds on previous levels to form a complementary system. Starting from the lowest level of authentication, we use the absolute timings of the keyboard input. We look at the absolute duration of all keypresses, and the time between successive digraphs/trigraphs, as is illustrated in Figure 2.

As shown in Figure 2, there are various time intervals that may characterise a digraph. Figure 2 indicates depression of a key "N" at time T1 , release of the key at time T3, depression of a key "O" at time T2, release of the key at time T4, Any or all the time intervals between (T1 , T3) and (T2, T4) may be used to represent the digraph, whereas the times T1-T3 and T2-T4 represent the time duration the keys remain depressed. A fundamental authentication algorithm employed in our hierarchical authentication algorithm employs a statistical measure of the distance between a given login ID/password entry and those stored as a reference value for that login ID/password - generated through the enrolment process. If they differ by a given threshold value, for all digraphs/trigraphs, then the input is considered invalid. The equation used for this authentication is provided in eqn 1 below. We have found that this type of authentication can produce an equal error rate (EER - the point where Far and FRR cross when plotted together) of approximately 5-7%. One can reduce the FAR by reducing the threshold value - but this generally tends to increase the FRR to unacceptable levels In accordance with the invention, we have employed additional algorithms to try to strike a balance between FAR/FRR - that is reducing the EER to as close to 0% as possible. We employ a series of algorithms, that build on lower level algorithms in order to achieve a better EER value. We collect data that is directly related to how the user types - that is to extract the dynamics of how the user enters their login details - these include digraphs/trigraphsdwell times/scan codes etc. A scan code is the ASCII value assigned to every key on a standard PC keyboard, which includes not only alphanumeric characters, but also punctuation marks and all other symbols. Other devices will have a similar labelling system that we would employ when applicable. We then derive higher order data from these fundamental inputs to derive statistical measures, entropy, and the like. These values are available for use in the higher order authentication algorithms such as the artificial immune system and the learning vector quantisation neural network. When we actually go through the authentication procedure, we start from the higher level algorithms and continue through the hierarchy until we reach the lowest level (i.e. the actual digraph/trigraph measurements). The reason is that the higher order algorithms such as a neural network or AIS algorithms are more robust to noise in the input data. On the other hand, low level algorithms such as using a margin for digraph/trigraph/duration values tends to generate a very accurate authentication algorithm with a low FAR, but with a higher FRR than the more robust algorithms. By combining both classes of algorithms into our system, the results indicate that we obtain the best of both: extremely low FAR with a concomitant low FRR. The lowest level algorithm in our hierachy employs a statistical classifcation mechanism. The algorithm works by determining the CV for all digraphs/trigraphs/durations and related measures in the reference vector. A critical element in this process is to gather the average values for all of the elements contained in the reference vector. When a user logs into the system, the values for all the elements contained within the reference vector for this login ID/password are determined and compared with that of the reference vector according to equation 1 below:

N*(λ - λ*CV) < X < N*(λ + λ*CV) (eqn 1 )

where N is a multiplicative factor, set to 1.0 for default stringency, λ is the value for a particular digraph/trigraph (stored in the reference vector) and the CV is the reference value for that digraph/trigraph generated during enrollment. The 'X' refers to the value we are measuring generated from a given login attempt.

If the particular feature input from a given login is contained within the margin specified in equation 1 , for all features (i.e. all digraphs/trigraphs/duration/etc) then the login attempt is assumed to be valid. All login details for this login ID are updated to reflect the latest successful login attempt and all related statistics are re-computed and stored for immediate use. A stringency factor (N in eqn 1) is directly related to these measures and is embodied as a multiplicative term. The stringency measure is employed during authentication at the lowest level of the hierarchy. This layer in the hierarchy can be thought of as the lowest level of authentication - used when all other measures provide inconclusive data regarding the authenticty of the current login attempt. The stringency threshold, as well as all others employed in this system are selected based on the values for these statistical measures captured during enrollment, and are updated during usage. For variations in the stringency level, a multiplicative term is applied to these statistics to derive the final thresholds for acceptance. For variations in stringency, values for the multiplicative term (N in eqn 1) are derived from the range of variation generated during the enrollment process. The final results is an average value for the statistical measures that are modulated by the range of values generated during enrollment, and are hence tailored to each particular user of the system. Also note that as the system maintains historical data, these measurements are updated over time and reflect the most current status of the user with respect to their input style. This allows a more realistic measure of 'high' stringency. Instead of generating a single measure of 'high' or 'low' stringency and applying it to all users, we derive a customised value of stringency that is based on each individual user, during the enrollment process. We generate a measure related to the entropy of the input sequence with respect to the reference vectors termed the edit distance. With this algorithm, we sort the reference vector attributes (digraphs then trigraphs and durations only) according to time in ascending order. We do the same for the same attributes extracted from the input sequence. We align the 2 sequences (reference and input) so they match up and record the number of moves required to make the 2 sequences match up (see Figure 5). This is called the edit distance. As shown in Figure 5: an example of the edit distance algorithm is taken from a 6 character login ID. The arrows indicate where the 2 elements in the string match and the number of displacements that are required to match all the symbols in the 2 strings represents the edit or Levenstein distance. In this hypothetical example, the edit distance is 6 - which represents the number of changes required to perfectly align the two patterns. Note that this value is divided by the total number of shuffles (16) to generate the actual score (0.375).

In order to convert this metric into a form of entropy, we divide the number of moves actually made by the number possible for a string of a given length (given in eqn 2 below). We have selected a threshold based on the entropy measure generated during the enrolment process. We use a threshold of entropy derived during the enrolment process to decide if the entropy is within acceptable limits. Eqn 2

|N²|/2 if the number of digraphs/trigraphs is even and

(|N²| -1 ) /2 If there are an odd number of digraphs/trigraphs where N = number of characters in the sequence We apply this edit distance measure as an adjunctive element in the authentication procedure - it is additional information that is used to help decide in cases where higher order algorithms have yielded marginally clear results.

The next level of our hierarchical approach employs algorithms of the type employed in machine learning techniques known as bioinformatics, or biologically-inspired computing. The next stage entails converting our array of floating point numbers resulting from measurements of digraph/trigraph/duration entries into a discrete alphabet. We select a discrete alphabet, such as the amino acid sequence in bioinformatics, which has a 20-character alphabet, representing the 20 amino acids commonly found in living organisms. During the enrollment process, we find the maximum digraph and trigraph values (we do not use duration for this calculation). We then divide the digraphs/trigraphs into discrete intervals, where each is the maximal value (assuming 0 as the absolute minimum value) by 20, the number of amino acids. We then assign each digraph/trigraph value the amino acid letter that corresponds to the appropriate interval. For instance, if the maximum value of the digraphs for enrollment was 0.70, then we would have 20 intervals of 0.035 each (corresponding to a range of 35 msec for that particular attribute). The amino acid sequence is sorted in ascending order, yielding the following sequence:

AA = "ACDEFGHIKLMNPQRSTVWY"

A digraph value of 0.000 - 0.035 would be given the symbol "A" and 0.060 would be given the symbol "C, etc. In this way, we have converted a collection of real numbers, with an infinite set of values into a discrete set. Factors that affect the granularity of these intervals are the size of the alphabet and the maximum attribute measure. From our experience, the maximal value is usually less than 1.5 seconds - and more typically has a value under 1 second - but this is user dependent. In cases where we find it necessary, we can switch to an alphabet with a larger or smaller cardinality - such as English alphabet with 26 elements or the genetic code that contains 4 elements. The lower the maximal value, the more consistently some types and hence finer granularity and vice versa. This allows this system to scale in accordance with the users typing speed - which although may vary with time and practise, is continuously updated. Once we have a collection of converted digraphs/trigraphs in the form of 10 strings (equal to the number of enrolment trials) of amino acid sequences, which are minimally 9 characters long (for digraphs, with a minimal login ID/password length of 6 characters), we are then able to apply selected bioinformatics based algorithms to assist us in characterising the reference vector, which can ultimately be used to aid in the authentication process. It is then required to generate entropy values for the enrolment results. Whilst this could be done as described above using CV values, a preferred method is as follows. The first stage is the generation of Shannon's entropy formulation:

n

H(x) = -Y₄P₁ IOg₂ P, (eqn 3)

/=1

Where p_/ is the probability of symbol / appearing. That is, the entropy of the event x is the sum, over all possible outcomes / of x, is the product of the probability of outcome / times the log of the inverse of the probability of /. This formulation accounts for symbols that appear with varying and unknown probabilities. We use the entries in the enrolment process to capture the expected probabilities by examining the entries in the enrolment process (see Figure 6 for an example of enrolment entries). Once we have the distributions for each symbol (which corresponds to each digraph/trigraph time), we can then use the formula in eqn 3 to calculate the entropy in the login ID/password sequence. More specifically, we calculate the total entropy of the enrolment entries and generate an average value of the entropy from these values. We use this average entropy value for the enrolment entries as a threshold for subsequent login attempts. For a subsequent login attempt, we can then calculate an entropy measure using the probability distributions generated from the entropy measure of the enrolment series. If the entropy of the login attempt is approximately equal to that of the threshold value (determined from an examination of the entropy of the enrolment entries), then this login attempt is considered valid from this measurement. Otherwise this authentication step pronounces this login attempt invalid. This criterion is not solely responsible for determining whether the input is authenticated: we use it in conjunction with other measures to make the final decision. This depends on the stringency level required by the security requirements of the installation.

Thus, once a table of the form of figure 6 has been created, which involves looking at the frequency of occurrence, or "substitutions" in bioinformatics terms, of various symbols in the alphabet selected above, to generate the entropy measure, we can calculate the occurrence/ substitution frequencies for each digraph in the table - which reflects substitutions of 1 digraph interval with another. This gives the pi in equation 3, the frequency distribution for each digraph/ trigraph. That is we now know from the data obtained from the enrolment, what the probability distributions are for each of the digraphs/ trigraphs in the login ID/password sequence. Then calculating the entropy is straightforward using the equation 3.

As shown in Figure 6, A collection of digraphs collected during a 10 entry enrolment and transformed into the amino acid alphabet for a given login ID: qxtjfa and password atbeus, are arranged in a table or matrix. Each column represents the same digraph in time sequential order and each row is a complete entry.

We also employ a multiple sequence alignment to the enrolment entries of Figure 6. The idea is to find regions of similarity between the input sequence and that of the enrolment sequences. Similar to the task in bioinformatics, we are looking for regions within the input sequences from the enrolment process that appear to be conserved - in that they occur with a high degree of regularity. Regions of high consistency indicate input attributes that are highly repeatable by this individual which we may be able to exploit for authentication purposes. We use a dynamic programming algorithm to find motifs that exists within the enrolment sequences. We can then use these motifs - subsequences within the enrolment entries and see if these occur within the input sequence - using the artificial immune algorithm in our case. If the motifs do exist in the input sequence, then it is highly likely that the input was produced by the authentic user. This level of analysis provides us with information on regional consistencies within the input - which indicate stability in typing style of the user. The pure digraph approach looks at the input from a very local level - requiring that all digraphs are matched between login attempt and stored reference values. One could choose to select a subset of the attributes to match with the reference vector. The multiple sequence alignment approach is a more objective and automated basis for deciding which attributes are important in classifying the input based on the stability of specific attributes or combinations thereof. Our results indicate that this approach has a significant impact on reducing the FRR compared to the exhaustive attribute match approach. The multiple sequence alignment approach looks more globally, searching for regions of similarity, but not necessarily across the entire input range. (See Figure 6 for an example of a transformed enrolment entry). It therefore is more tolerant to noisy inputs, as might happen when a user logs in without being completely conversant with their login ID/password. This approach helps reduce the FRR without compromising the FAR results - when used in conjunction with the next level in the hierarchy.

To gather additional information about the authenticity of the login, we employ these motifs discovered by the multiple sequence alignment in the next level in the authentication hierarchy, an artificial immune system (AIS). The AIS algorithm employs the biological immune system as a metaphor for various types of machine learning algorithms. In our case, we are using the AIS algorithm for pattern recognition - a classification task. We use a subset of the generalised AIS algorithm in order to keep the computational load at a minimum. We use the motifs discovered in the MSA algorithm to form the basis of self/non-self discrimination - as our immune systems do. We have a mechanism that can automatically discriminate between self and non-self - and we exploit that concept in the form of an AIS algorithm. There are 2 basic elements involved in this process - something called a major histocompatibility complex (MHC) and a collection of antibodies (Ab), that can bind to MHC molecules, once they have bound to an antigen (see Figure 4). It will be understood that the antibodies represent numeric values that are the complement of the numeric values for the MHC values, in that the numeric values when added together are equal to 1 , on a normalised scale of digraph times.

In the AIS algorithm as we embody it, the input sequence (the login ID/password represented in terms of the amino acid sequence as described above) serves as a potential antigen - an object that immune system tries to remove if it is found to be foreign. By foreign, we mean it does not contain markings that indicate it belongs to the possessor of the responding immune system. In our algorithm, the potential antigen is the set of motifs discovered by the MSA algorithm. An entity is considered foreign if it does not possess those antigens. The pattern recognition aspect of this algorithm is evident - if the input string (login ID/password) contains sufficient antigenic sites that are recognised by the MHC molecules, then the pattern is recognised as self - that is it may be an authentic login attempt. Otherwise it is considered foreign and rejected. In order to implement the AIS algorithm as we have embodied it, we need a collection of MHC molecules and a collection of antibodies. The MHC molecules are derived from the multiple sequence alignment algorithm, acting on the enrolment values as indicated in Figure 6. In Figure 7, we present a graphical depiction of what a collection of MHC molecules may look like for a given enrolment. The data in figure 7 are generated using the multiple sequence alignment, using the data of Figure 6 from the transformed enrolment sequences into the amino acid symbol set to derive the frequencies of expected occurrence for each symbol within each column position. Then we apply a multiple sequence alignment algorithm to generate the multiple sequence alignments. Then we select those that correspond to the following requirements: if a column is represented by 1 predominant symbol, then that symbol alone is used. If there are 2 predominant symbols, than we form another sequence alignment using the second predominant symbol. By predominant - we mean 50% or approximately. The guiding measure here is how many different symbols there are in a column and their proportion. If they are approximately 50% - specifically four or more of one symbol and the rest a second symbol exclusively, then we generate 2 sequence alignments. Each of the alignments can be used to generate MHC molecules - with the mismatch sequences ('-') serving to demarcate the MHC molecules. These MHC molecules then represent identities or near identities in the original attributes - in this case digraph times. This captures the regularities that are inherent within a given enrolment example. We then use these MHC molecules to identify authentic samples from foreign (non-authentic samples). The next stage is the development of antibodies that react with the MHC molecules to complete the algorithm. In our model of the AIS system, we employ antibodies to ensure that all MHC molecules have bound to the antigen. That is, we create a collection of antibodies (a minimal set) that will react with all the MHC molecules for each alignment sequence (in Figure 7 that would result in 2 antibodies). By this we mean each antibody will have as many binding sites as there are MHC sites available for a given login sequence. For Figure 7, antibody one has three MHC binding sites and antibody two has 2 MHC binding sites. After the MHC molecules are allowed to combine with the input symbol sequence, then we apply the antibodies. If an MHC molecule has found their corresponding antigen within the discretised login ID/password attempt - and by this we mean contain the same set of symbols that found in the MHC molecules, then the MHC molecules become activated. Antibodies only react with activated MHC molecules. An antibody that becomes activated must bind to all of the MHC molecule binding sites it possesses. This in turn requires that all MHC molecules generated during the MSA must find a suitable match in the input string. (in symbol sequence form). Only when all MHCs for a given antibody are activated, will the antibody be activated. The algorithm determines if there is an activated antibody - if there is then the input is considered authentic, otherwise it is considered a failed attempt. It will be understood that the antibodies represent numeric values that are the complement of the numeric values for the MHC values, in that the numeric values when added together are equal to 1 , on a normalised scale of digraph times.

This algorithm evolves with the users input over time through subtle mutations, responding to in subtle changes in motifs (i.e. user's typing style). This necessitates that MHCs and antibodies are modified to ensure that they are able to cover new motifs and produce the corresponding MHCs and antibodies as they are produced.

Another consideration is the granularity of the alphabet with respect to the enrolment thresholds. If the thresholds set during enrolment are very low - this will place constraints on the granularity of the discretised alphabet. If we have very stringent thresholds on consistency and variability - this will mean that we have to select a discretising alphabet consistent with these thresholds. The nucleotide alphabet, with 4 elements ACTG would yield the same values for a range of 250 msec (if the maximum value for digraphs etc was 1.0 second). This would yield a completely identical set of discretised strings and hence would degrade to matching every single digraph. This negates the local feature aspect of this algorithm. Hence the choice of alphabet is in part dictated by the stringency required - and consequently the thresholds used during the enrolment process.

The above approach provides the means to identify a login ID/password entry that assumes there are only authentic samples. There are machine learning algorithms that serve as classifiers such as the many varieties of neural networks, along with other classifier based systems. Many of these require examples of all classes that they are meant to classify - they are trained using supervised training algorithms. In the case presented here, when we have just enrolled a user, there are only examples of valid enrolments - although one can envision mistakes being made during enrolment that might generate false attempts - but one can not guarantee that will happen. The algorithms mentioned described in detail so far only assume that we have a single class of examples - those from the authentic user. Examples of non-authentic login attempts - either from an imposter or from a failed authentic user attempt. If there is sufficient number of non-authenticated samples, then we may employ one of the supervised neural network or related classifiers. We selected to use the learning vector quantisation (LVQ) neural network, as it is very efficient, accurate, with minimal storage space.

The LVQ algorithm is a supervised neural network classifier that requires at least 2 classes of training examples - in our case, one corresponding to authentic users and then those that resulted in failed login attempts. To train the system, one must associate a decision label to all training examples - either authentic (which are labelled 1) or failed attempts (which are labelled 0). These should form 2 well separated clusters of points, where each point is a multidimensional vector. This vector corresponds to the average values for each digraph/trigraph (in our case we find it sufficient to use only one of the two - we opted to use digraphs) generated during enrolment an any updates over time. This algorithm then uses a spatial metric (Euclidean distance measure) to decide which class a login attempt belongs to. The algorithm employs both a training and a testing phase. During the training phase, we select a subset of members from each class to train the system (typically 70% of the total in each class, with an equal number of items from each class in the ideal case). We then select one of the remaining 30% of each of the two class samples and measure the distance between that example and the points in the clusters. We find which cluster the example is closest to. We look at the label associated with the particular example and the label of the point it was closest to in the clusters. Examples that are correctly classified are made closer to the training vectors and those that are incorrectly classified are moved further from the training examples. During this process, which takes generally 50 iterations, we end up with two very distinct classes that are extremely well separated. During the test phase, we simply use the geometric distance measure to determine which class the given input is closest to. We then assign it to that class - which is either authentic or imposter. We also update the code vectors to maintain a more accurate estimate of spatial position of the clusters - which in turn reflects the evolution of the user's login characteristics.

The incorporation of the LVQ algorithm enables us to use the speed and accuracy of machine learning techniques when we have sufficient data for failed login attempts. The algorithm is in general more flexible than statistical methods and in many cases more robust. The LVQ is a very fast algorithm which produces results comparable to other neural networks with a minimal amount of storage required to re-run the network on new samples. The results from the LVQ are again used in conjunction with other results to form a multimodal decision (or a hierarchical network of classifiers) about whether to authenticate a given login attempt.

In addition to these algorithms described above, we look at the historical context of a user entry into the system. After enrolment, there are inevitably changes in the typing styles of users when they log into their accounts. We capture this variation by updating their reference vector as stated previously. But in addition, we implement the Power Law of Practise to capture changes in typing style that reflect a learning and/or practise effect. This feature captures the effect of practise during the typing process - as one becomes more familiar with typing their login ID/password, they enter it more accurately, with less variation, and also generally more quickly. The general equation for the Power Law of practice is given in equation 4.

T = BNT (eqn 4) where, T = current expected time to perform a task B = time to perform the task initially and

^"α and the exponent is the learning rate

Generally, one plots the logφase 10) of time taken (T) against log(base 10) of the ratio of B/N. The slope of this equation provides us with the alpha term - the quantitative estimate of the power of practise. In our implementation, we maintain records of the total login ID and password entry times for all entries (enrolment and otherwise). During enrolment, we measure the alpha for the owner of the login ID/password. Each time a successful login attempt occurs, we update this value. We then maintain the alpha value for the last login attempt and also record the number of successful login attempts for this login ID/password. We would expect that the typing time would decrease slightly with familiarity/practise. When an imposter attempts to use someone else's account, they would be expected to be unfamiliar or at least less familiar than the legitimate owner. The time required to enter the login ID/password would generally be different, then that of an experienced user of this login ID/password. This would alert the system that this is a potential imposter and ask the user to re-input their details. If the speed of login ID/password does not match (i.e. the wrong value for alpha) this may indicate that an imposter is attempting to login. It could also indicate though that the legitimate owner made a mistake when they first typed in their details, so another chance is given. If the same mistake occurs again - i.e. the details do not match in terms of the Power Law of Practise calculation, then the user is assumed to be an imposter. Again, this information is combined with all other information available to make the decision about authenticating this login attempt., In addition, given that a user has three chances to login into the system, we check to see that the learning factor is incorporated into all 3 attempts. The consistency across these attempts, reflecting any issue of practise will be evident in most cases. One would expect that a user whom is unfamiliar with a given login ID/password would produce a considerable amount of variation when typing in a login ID/password - across multiple attempts (maximum of 3 before lockout occurs). The authentic owner will produce variation, but on a smaller scale with less inherent variation - this is the effect of practise. Authentication Example:

When a user logins into the system, the digraphs, trigraphs, keypress duration, and scancode - the value associated with each character on the keyboard is recorded using an assembly language routine. Then, the reference vector stored for this login ID/password is compared with the entry and a series of authentication procedures are employed to determine in a quantitative and automated fashion whether the input was entered by the authentic owner of this login id/password combination. This process is determined in a multi-step fashion.

We first derive a set of values which record the statistical values for digraphs/trigraphs, entropy measure and multiple sequence alignments. We then use these features in a top-down approach, where we start with the algorithm depicted in the top of the hierarchy (as illustrated in Figure 8). The particular algorithm(s) and/or their weight(s) are determined by the required level of security. If non-authentic samples are available, the LVQ algorithm can be employed - otherwise it can not be used. At the top end security mode, all algorithms are used and weighted equally, with the most stringent levels for parameters such as N in equation 1. This ensures a very secure and effective authentication algorithm that is more comprehensive then any reported in the literature. In order to scale the stringency of the authentication process, the values of various thresholds can be modulated accordingly. For instance, in equation 1 , the value of N - the stringency factor - can be increased in order to relax the consistency requirements of a user's login details. This allows for a more flexible set of digraphs/trigraph entries. This would allow users with variable input patterns to be rejected by the system less often (reduced FRR) - at the expense of increasing the FAR slightly. The same holds true for other parameters in the model - such as entropy thresholds. The system ensures dynamic flexibility by setting critical parameter values to be just above those measurements obtained during the enrolment process and modified through the course of a user's input history. If one maintains values for parameter at the levels obtains through direct usage, without applying a margin level, then one can obtain a very tightly coupled system that works at the level of the user. This reflects the balance between FAR and FRR. By combining these various levels of security, which are complementary and work at different levels, one can also tailor the security level to meet the patterns of the user. The AIS algorithm gathers information at a local level - which highlights features that are individualistic and may not be repeatable by an imposter. This level of security is less stringent than the lower level algorithm, which requires all digraphs are within a specific range of the reference value obtained during enrollment. In essence, the stringency works in reverse order from the hierarchy. Yet these various elements provide complementary information, that when combined provide an extremely robust authentication mechanism.

If this is the first attempt after enrollment, or there is no record of any failed login attempts for this account, then the LVQ algorithm is not employed, as it assumes that there is at least 50% as many non- authenticated login attempts as authenticated attempts. Entropy measures are recorded - both with respect to edit distance. The artificial immune system (AIS) is then employed to determine if this attempt is sufficiently close to those found in enrollment - either the naϊve enrollment (where only enrollment attempts have been entered) or whether there has been a historical update of the login attempts. The later case indicates that as a new valid entry is created, the oldest entry is replaced with the new entry to maintain a maximum of 10 entries for this login ID/password combination - and the relevant statistical measures are updated accordingly. The AIS algorithm selects features from the account data that are uniquely consistent - starting with single digraphs/trigraphs - looking for elements such as very short or very long durations, consistency between successive timing elements etc. These elements describe what the self is - that these characteristics belong to the authentic user. When a login attempt is made - the antibodies that react with unique features are released on the input and a recording is made as to how much of the input is matched by pairing in a complementary fashion with the antibodies. These antibodies react with a pool of MHC molecules raised during the enrolment process. This process requires that we convert the input attributes into a discrete symbolic alphabet. In the default case, we use the amino acid alphabet, which provides 20 discrete ranges that the input parameters are mapped into. In order to map an input element (a digraph/trigraph) we divide the range of input values by the cardinality of the alphabet. We then assign each input element a value from the discrete symbolic alphabet and work with this representation. The same transformation occurs to the input values when a login attempt occurs. We then allow the MHC molecules to bind to the transformed input and then allow the antibodies to react with the activated MHC molecules (those that found matches in the input string). If all MHC sites on the antibodies are activated, this input is considered valid from this algorithm.

At the next level, we calculate the edit distance and Shannon's entropy between the input and the enrolment set. The edit distance is calculated based on the number of sorts required to match the input sequence with the average reference value for that login ID. This is divided by the total number of swaps possible with an input sequence of a given length. If the value is greater than a given threshold (set to the average value of the enrolment sequence) then this input is marked as rejected by this algorithm, otherwise it is accepted. Also, the information theoretic entropy is measure, as per equation 3. If the inherent entropy of the system differs significantly from any particular element in the enrolment sequence or the average value, then this algorithm votes a rejected for this input. Lastly, the absolute value of all digraphs/trigraphs/durations is compared with the reference values. If they differ beyond a threshold level, then the vote is to reject this input based on this algorithm. Lastly, we look at the history of a given login ID/password sequence through the Power Law of Practise. If the typing efficiency for the login ID and password, considered separately do not match the values recorded for this login ID/password sequence, then the vote is to reject this input. This feature reflects the power of practise and familiarity that a user develops when entering the login details repeatedly. By maintaining a constant historical record of the changes a user undergoes when entering their login details over time, we can maintain much tighter statistical measurements and related quantities (such as entropy measurements) which greatly enhances the discriminatory capacity of this system. This is indeed one area where the stringency of the system can be altered to the greatest effect. If we tighten the parameter values to accurately reflect changes in the user's typing styles, then we can reduce the allowable variation in the keystroke dynamics employed by users entering their login details. This is after all, the basis of this algorithm - to provide a tailor made range of values for keystroke dynamics that are reproducible by the authentic user, but hard to emulate by an unauthorised user, within the constraints of 3 failed attempts and the system locks the account out.

Advantages of the invention:

• is fast and very reliable with low FRR. Even though the system employs several hierarchical layers to perform the classification, the system classifies in real-time mode in under 1 second on a standard Pentium IV 1 GHz PC.

The system as tested to date has a very low FAR (0.01%) and low FRR (0.00%) when tested in a heterogeneous group of subjects.

• may be easily adapted to work as an identification tool given that all properly enrolled samples could be used via the neural network classifier. A Learning Vector Quantisation (LVQ) neural network is employed in our system that would be able to classify a given input into its proper class if the input is valid. (i.e. belongs to an actual user which has details stored in the database) • works without the need for imposter samples

• evolves with user performance incorporating the effects of practise

• only requires a minimal set of training samples

• has truly user adjustable parameter settings and takes into account natural variability in typing styles

• its multi-tiered approach provides greater flexibility in the authentication process

Claims

1. A method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string during the login procedure , the method comprising: an enrolment process wherein a set of data is determined representing characteristics of key dynamics of the user; said characteristics comprising at least one of digraphs, trigraphs, dwell times, and assessing the key dynamics of a user's login attempt by comparison with said set of data, comprising providing at least the following processes: a) a process comprising a statistical estimate of a user's login attempt; b) a process comprising an entropy estimate based on the degree of disorder in the login attempt; and c) a process comprising at least one machine learning technique where an assessment is made as to whether characteristic features of a collection of characteristic features are present in the login attempt; and depending on the level of security required, deploying any or all of processes a) , b) and c) for assessing a login attempt.

2. A method according to claim 1 , wherein said attributes comprise at least two of digraphs, trigraphs, dwell times,.

3. A method according to claim 2, wherein said attributes comprise all of digraphs, trigraphs, dwell times.

4. A method according to any preceding claim, said method comprising updating said set of data when subsequent login attempts are deemed successful or at selected intervals, using data derived from login attempts.

5. A method according to any preceding claim, wherein in the enrolment process, said character string is entered a plurality of times, and a respective coefficient of variation is determined for time periods representative of each digraph or trigraph in the character string, and threshold values of entropy and consistency are computed from the determined coefficients of variation representing threshold values, and stored as part of said set of data.

6. A method according to claim 5, wherein in said statistical estimate of process a), a selected digraph, trigraph, or dwell time is compared with upper and lower values, which are set depending on the corresponding said coefficient of variation.

7. A method according to claim 6, wherein said upper and lower values are set depending on a numerical stringency coefficient which is updated during use.

8. A method according to any preceding claim, wherein in said entropy estimate of process a), an edit distance is computed of an amount of rearrangement necessary of selected values of a login attempt to match with corresponding values of the enrolment process, and said edit distance is compared with a threshold value.

9. A method according to any preceding claim, wherein in said entropy estimate of process a), an entropy value is computed based on Shannon's formulation of selected values of a login attempt.

10. A method according to any preceding claim, wherein in an initial step of the enrolment process, and of subsequent login attempts, selected values of the user's keyboard entry are assigned to intervals in a range of values, and each interval is assigned a denoting symbol.

11. A method according to claim 10, wherein said selected values are digraph time values, and each said interval represents a time interval within said range.

12. A method according to 10 or 11 , wherein, in the enrolment process entropy values are computed according to Shannon's formulation, from a collection of enrolment login attempts, for each said symbol.

13. A method according to 12, wherein, in a login attempt, an entropy values is computed according to Shannon's formulation, and compared with the entropy values computed during the enrolment process.

14. A method according to any of claims 10 to 13, wherein a multiple sequence alignment process is deployed wherein repetitive motif patterns are determined in said selected values.

15. A method according to claim 14, wherein said machine learning technique comprises an artificial immune system algorithm, and said repetitive motif patterns are denoted as respective MHC molecules.

16. A method according to claim 15, wherein a respective antibody is determined for an antigen sequence of at least one MHC molecule, wherein the antibody is composed of a string of symbolic values related to the symbolic values of the or each MHC molecule (i.e. they are complimentary to the MHC symbols

17. A method according to claim 16, wherein, in a subsequent login attempt, the presence of said MHC molecules is determined, and a determination is made whether a said antibody binds to a said set of MHC molecules in the login attempt.

18. A method according to any preceding claim, wherein, when a plurality of invalid login attempts have been recorded, there is deployed a learning vector quantisation neural network algorithm in process c)

19. A method according to any preceding claim, wherein a Power Law of Practise, is implemented as a logarithmic function as a further assessment of a login attempt..

20. A method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a log in string as a login procedure, the method comprising: an enrolment process wherein user data is determined from a plurality of enrolment login attempts, representing characteristics of key dynamics of the user, wherein in an initial step of the enrolment process, selected values of the user's keyboard entry are assigned to intervals in a range of values, and each interval is assigned a denoting symbol, and wherein a multiple sequence alignment process is deployed wherein repetitive motif patterns are determined in said denoting symbols by a multiple sequence alignment process. and assessing the key dynamics of a user's login attempt by comparing characteristics of the login attempt with said repetitive motif patterns..

21. A method according to claim 20, wherein said selected values are digraph time values, trigraph time values, or dwell time values, and each said interval represents a time interval within said range.

22. A method according to 20 or 21 , wherein, in the enrolment process entropy values are computed according to Shannon's formulation, from a collection of enrolment login attempts, for at least one said denoting symbol.

23. A method according to 22, wherein, in a login attempt, an entropy values is computed according to Shannon's formulation, and compared with the entropy values computed during the enrolment process.

24. A method according to any of claims 20 to 23, wherein said assessing the key dynamics of a user's login attempt comprises an artificial immune system process, and said extracted motif patterns are employed as inputs to the process.

25. A method of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the method comprising: an enrolment process wherein user data is determined from a plurality of enrolment login attempts, representing characteristics of key dynamics of the user; providing an Artificial Immune System process, and generating at least one string of data items, each string representing a sequence of characteristic features found in the dynamics of an enrolment login attempt, each string being denoted as an MHC molecule, and generating at least one antibody from a sequence of at least one MHC molecule, said antibody comprising a string of numerical values related to the numerical values of the or each MHC molecule ; and assessing the key dynamics of a user's login attempt by locating MHC molecules therewithin , and applying said antibody to the located MHC molecules to assess a level of correlation therebetween.

26. A method according to claim 25, wherein said MHC molecules are determined by a multiple sequence alignment process, where selected features of said plurality of enrolment login attempts are arranged in rows and columns, repeated features in each column are determined, and an MHC molecule is defined as a sequence within a row of said repeated features.

27 A method according to claim 26, wherein an antibody is determined from a sequence of MHC molecules extending across a row, wholly or partially.

28. A method according to claim 26 or 27, wherein said selected features relate to digraph time period values, trigraph time period values, or dwell time period values, and a collection of consecutive digraph values within an enrolment login attempt is arranged in a said row.

29. A method as claimed in any of claims 25 to 28, wherein when a collection of invalid login attempts are collected, a classifying supervised neural network is deployed, that provides an assessment of a user's login attempt.

30. A method according to claim 29, wherein said neural network comprises a learning vector quantisation process.

31. Apparatus for authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the apparatus comprising: an enrolment means storing a set of data representing characteristics of key dynamics of the user, said characteristics comprising at least one of digraphs, trigraphs, dwell times,; means for assessing the key dynamics of a user's login attempt by comparison with said set of data, comprising providing at least the following assessment means: c) an assessment means comprising a statistical means for estimating a user's login attempt; d) an assessment means comprising an entropy means for estimating a login attempt based on the degree of disorder in the login attempt; and c) at least one machine learning means for assessing whether characteristic features of a collection of characteristic features are present in the login attempt; and means for deploying any or all of means a), b) and c) for assessing a login attempt, depending on the level of security required.

32. Apparatus according to claim 31 , wherein said enrolment means includes a respective coefficient of variation determined for time periods representative of each digraph,trigraph, and/or dwell time in the character string, and threshold values of entropy and consistency computed from the determined coefficients of variation representing threshold values.

33. Apparatus according to claim 32, wherein said statistical means includes means for comparing a selected digraph, trigraph, and/or dwell time with upper and lower values, which are set depending on the corresponding said coefficient of variation.

34. Apparatus according to claim 33, wherein said upper and lower values include a numerical stringency coefficient, and means for updating said numerical stringency coefficient during use.

35. Apparatus according to any of claims 31 to 34, wherein said entropy means includes means for computing an edit distance representing an amount of rearrangement necessary of selected values of a login attempt to match with corresponding values stored in the enrolment means, and means for comparing said edit distance with a threshold value.

36. Apparatus according to any of claims 31 to 35, wherein said entropy means includes means for computing an entropy value based on Shannon's formulation of selected values of a login attempt.

37. Apparatus according to any of claims 31 to 36, including bioinformatics means for assigning, in a login attempt, selected values of the user's keyboard entry to intervals in a range of values, and each interval is assigned a denoting symbol, and said enrolment means includes a table of denoting values determined from the enrolment process.

38. Apparatus according to claim 37, wherein said selected values are digraph time, trigraph times, dwell time values, and each said interval represents a time interval within said range.

39. Apparatus according to 37 or 38, including means for computing, in a login attempt, entropy values according to Shannon's formulation, and means for comparing such values with the entropy values stored in the enrolment means.

40. Apparatus according to any of claims 37 to 39, including a multiple sequence alignment means for determining repetitive motif patterns in said selected values.

41. Apparatus according to claim 40, wherein said machine learning means comprises an artificial immune system means, wherein said repetitive motif patterns are denoted as respective MHC molecules.

42. Apparatus according to claim 41 , wherein said artificial immune system means comprises means for determining a respective antibody for an antigen sequence of at least one MHC molecule, wherein the antibody is composed of a string of numerical values related to the numerical values of the or each MHC molecule

43. Apparatus according to claim 42, including means for determining, in a subsequent login attempt, the presence of said MHC molecules and whether a said antibody binds to a said set of MHC molecules in the login attempt.

44. Apparatus according to any of claims 31 to 43 .wherein said machine learning means comprises a learning vector quantisation neural network algorithm.

45. Apparatus according to any of claims 31 to 44, including means for providing a Power Law of practise, implemented as a logarithmic function, as a further assessment of a login attempt.

46. Apparatus for authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the apparatus comprising: an enrolment means storing a set of data representing characteristics of key dynamics of the user; said user data including selected values of the user's keyboard entry assigned to intervals in a range of values, and each interval is assigned a denoting symbol, and repetitive motif patterns determined in said denoting symbols by a multiple sequence alignment process; and means for assessing the key dynamics of a user's login attempt by comparing characteristics of the login attempt with said repetitive motif patterns..

47. Apparatus according to claim 46, wherein said selected values are digraph time values, trigraph time values, dwell time values, and each said interval represents a time interval within said range.

48. Apparatus according to claim 46 or 47, wherein said user data includes entropy values are computed according to Shannon's formulation from a collection of enrolment login attempts, for at least one said denoting symbol.

49. Apparatus according to 48, wherein said means for assessing includes means for computing entropy values according to Shannon's formulation, and for compared such values with the entropy values stored in the enrolment means.

50. Apparatus according to any of claims 46 to 49, wherein said means for assessing the key dynamics of a user's login attempt comprises an artificial immune system means, wherein said repetitive motif patterns are employed as inputs to the process.

51. Apparatus of authenticating the user of a data processing device having a key device for entering characters, and requiring entry of a login string as a login procedure, the apparatus comprising: an enrolment means storing user data determined from a plurality of enrolment login attempts, representing characteristics of key dynamics of the user; an Artificial Immune System means for generating a strings of data items, each string representing a sequence of characteristic features found in the dynamics of an enrolment login attempt, each string being denoted as an MHC molecule, and means for generating at least one antibody from a sequence of at least one MHC molecule, said antibody comprising a string of numerical values related to the numerical values of the or each MHC molecule; and means for assessing the key dynamics of a user's login attempt by locating MHC molecules therewithin , and applying said antibody to the located MHC molecules to assess a level of correlation therebetween.

52. Apparatus according to claim 51 , including multiple sequence alignment means for determining said MHC molecules, where selected features of said plurality of enrolment login attempts are arranged in rows and columns, repeated features in each column are determined, and an MHC molecule is defined as a sequence within a row of said repeated features.

53 Apparatus according to claim 52, including antibody means for determining an antibody from a sequence of MHC molecules extending across a row, wholly or partially.

54. Apparatus according to claim 52 or 53, wherein said selected features relate to digraph time period values.

55. Apparatus according to any of claims 51 to 54, including: classifying supervised neural network means for providing an assessment of a user's login attempt when a collection of invalid login attempts are collected.

56. Apparatus according to claim 55, wherein said neural network comprises a learning vector quantisation means.