GB2587000A

GB2587000A - Method of testing and improving security in a password-based authentication system

Info

Publication number: GB2587000A
Application number: GB1913125.9A
Authority: GB
Inventors: Daigniere Florent
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2021-03-17
Anticipated expiration: 2039-09-11
Also published as: GB2587000B; GB201913125D0

Abstract

A method and system is disclosed for testing passwords on a computer system against a list of compromised passwords. The system uses a probabilistic data structure, for example a Bloom filter, to check that either a password which a user is in the process of trying to set, or a password which is already set and stored in the system, is on a list of compromised passwords, for example passwords which have been stolen in a hacking attack against a different unrelated system. This is done without storing any information from which compromised passwords may be retrieved, and furthermore in a data structure which is small and fast, even for huge sets of compromised passwords

Description

METHOD OF TESTING AND IMPROVING SECURITY IN A PASSWORD-BASED AUTHENTICATION SYSTEM

The present invention relates to a method of testing security in any computer system where passwords are used to authenticate users.

BACKGROUND TO THE INVENTION

Passwords remain the most common mechanism for authenticating users in a computer system, whether alone or as part of multi-factor authentication systems. However, it is well known that password protected systems are vulnerable to attacks where a hostile party is able to -one way or another -correctly guess a real user's password.

Attackers are typically able to guess users' passwords for a number of reasons. Firstly, users often choose passwords which are very easy to guess. For example "12345" and "password" are both extremely common passwords. In a system with a large number of users, an attacker has a reasonably good chance of gaining access by just trying a handful of very common passwords against every single user.

Even where a user tries to choose a hard-to-guess password, if a weakness in the system allows an attacker to test multiple passwords at high-speed, either "online" or "offline", then increasingly fast hardware together with the increased availability of highly-parallel processing across the internet (including via "botnets" of compromised machines) mean that millions of putative passwords can be tested every second. An exhaustive search can easily be performed, for example covering every possible password up to 8 letters long using lowercase English letters.

Over the past decades attempts to force users to choose more "secure" passwords have tended to focus on making this sort of exhaustive search more difficult by insisting that users include for example both uppercase and lowercase, and non-alphabetic characters in passwords. However, this approach alone is still highly vulnerable to poor choices made by users. Many short and insecure passwords can be made to comply with common "password complexity" requirements by capitalising the first letter and adding a "1" at the end, and/or perhaps a "!" character. Many users do exactly that, and indeed "Password1!" has been found to be a very commonly used password.

A more effective way to protect against this kind of "brute force" attack is to prevent more than, for example, three incorrect password entries before an account is locked out, to take measures to prevent password hashes being retrieved for use in offline brute force attacks, and to use suitably secure (which means suitably slow in both software and specialist hardware implementations, and also collision-resistant) cryptographic one-way hashing systems to reduce the probability that passwords can be guessed by brute force even if the hashes are discovered.

Where users do choose passwords which are hard-to-guess (by human or by machine) they can still be vulnerable if the same password is reused on different systems. A typical person who makes use of even a modest number of online services is likely to have dozens of online accounts. Most of these accounts are hopefully held with reputable online services with good security systems of their own, but if a user holds just one account with an online service which is hacked, and that service has vulnerabilities allowing an attacker to retrieve a plaintext password, then all of the user's accounts, on all online services, become highly vulnerable. An attack where known username / password combinations from one hacked service are tested against another service is known as "credential stuffing".

Credential stuffing can be a particularly damaging type of attack because even when users have chosen passwords which are difficult to guess, and even when services have implemented strong security measures to prevent an exhaustive search being carried out, a user's password might easily be guessed on the first attempt if the same password was used on a hacked system. "Highly complex" passwords might even be more vulnerable to this kind of attack, since the increased difficulty involved in remembering "complex" passwords means that users are tempted to use the same one in multiple systems.

To ensure that users are choosing secure credentials, it is now thought to be more important to ensure that users are not choosing passwords which are known to be compromised than it is to insist on "complexity". As a result, various systems have been proposed which test user passwords against a known list of "compromised passwords". In NIST Special Publication 800-63B, "Digital Identity Guidelines", the US Department of Commerce recommends that "verifiers SHALL compare the prospective secrets against a list that contains values known to be commonly-used, expected, or compromised".

However, there is a problem in implementing such systems which arises out of the sheer size of compromised password lists. Some current, filtered, cleaned and curated lists run to over 500 million passwords. The computational effort involved in testing whether a particular password is included in such a large list is considerable, and may be impractical for an organisation which wants to make periodic checks on, for example, thousands of stored user passwords.

Some entities offer a checking process against a large compromised password list as a service, so that system managers can send in passwords (or, more likely, hashes of passwords) or lists of passwords / password hashes to the service and get back a result indicating any which are on the compromised list. However, even with cryptographic hashing this potentially opens up a serious security vulnerability in itself, as the password or a hash of the password is sent to a third party. From the system manager's point of view this clearly increases the opportunity for the credential to be stolen.

Alternatively, a database of compromised passwords can be loaded onto the system which is being checked. Each password (when it is set) or each hash (while it is stored) can be checked against the compromised list, without sending them outside the system which is supposed to store and process them anyway. However, because of the very large size of the list, this represents a significant computational task. The machine which runs the authentication system might not be of a particularly high specification -because for a normal authentication workload it does not have to be. But testing every stored password against a 500-million long list may use significant processing time and, especially, memory. The better-designed such systems use a binary search, but the memory requirements are still likely to be very large compared with the resources available on typical authentication servers.

Many small-to-medium-sized businesses have a single machine which handles not only authentication but also other server functions, for example it may act as a file server, web server, DHCP server, DNS server, etc. A password checking workload relying on search against a very large list may significantly and adversely impact the other functions of the server, to an unacceptable extent.

It is an object of the present invention to provide a method of checking passwords against a compromised password list with significantly reduced processor and memory resources.

SUMMARY OF THE INVENTION

According to the present invention, there is provided a method of testing passwords on a computer system for presence on a list of compromised passwords, the method comprising the steps of: obtaining a list of compromised passwords, or a list of transformations of 5 compromised passwords; building a probabilistic filter data structure corresponding to the list of compromised passwords or transformations, the probabilistic filter data structure not storing any item in the list of compromised passwords or transformations, and having a total size smaller than the total size of the list of compromised passwords or transformations, and the probabilistic filter data structure having the property that it is possible to test a candidate password or a transformation of a candidate password for its presence in the list of compromised passwords or transformations, and obtain a probabilistic output as to whether the candidate password is on the list or not, loading the probabilistic filter data structure onto the computer system to be tested, testing at least one candidate password or transformation of a candidate password on the computer system against the probabilistic filter data structure, and indicating whether the candidate password is probably on the list of compromised passwords, or probably not on the list of compromised passwords.

The invention has the advantage that the probabilistic data structure may be much smaller than the original list. For example, a reference file containing over 500 million compromised passwords has a compressed size of 8.75GB. A suitable probabilistic data structure may require about 10 bits per element, or about 625MB. More generally, the size of the probabilistic data structure may be much smaller than the minimum possible size of the original list of compromised passwords or transformations after lossless compression.

A data structure of 625MB corresponding to 500 million compromised passwords is easily small enough to be downloaded in a reasonable period of time on an average business internet connection, and more importantly it is small enough to be held in memory on most modem machines, even relatively low-specification machines and even machines which are required to concurrently perform other tasks.

Testing a candidate password against the probabilistic data structure may be done in a constant time (constant number of instructions and memory accesses), irrespective of the number of compromised passwords in the original list. This is another significant improvement over current products which tend to use a comparison-based binary search, which becomes increasingly time consuming as the log of the number of elements in the original list of compromised passwords.

A further advantage of the invention is that the probabilistic filter data structure does not store any single item in the list of compromised passwords, or transformations of compromised passwords. In other words, no information about the underlying compromised passwords can be retrieved from the probabilistic data structure. This is useful for several reasons. Firstly there might be legal problems in some jurisdictions with distributing other people's credentials which have been compromised in a hacking attack. In some cases users may choose to include personal information in passwords and, albeit those details have already been compromised and are likely to be available on the intemet, reputable businesses would not want to play a part in making them more widely available. More importantly, this property of the probabilistic data structure means that compromised passwords which are not, or may not be, publicly available may be included in the system to improve security, without potentially making a new security problem by making passwords public when they have not already been made public.

As an example, a reputable system operator might discover a vulnerability in their system which might have allowed passwords, or password hashes, to be stolen. However, they may not be able to tell if the passwords have actually been accessed by a malicious party or not. The responsible system operator would therefore not want to distribute passwords or hashes on the basis that they might have been compromised, because by doing so they would be making certain that they had been. However, the responsible system operator could create a probabilistic data structure which can be used to test candidate passwords to see if they are on the list of possibly compromised passwords or not.

Reputable system operators might even generate and distribute probabilistic data structures based on passwords used in their systems irrespective of any suspicion that data has been or might have been stolen. This means that users can be prevented from re-using passwords across different systems, which provides the best preemptive protection in case in the future one of those systems does become compromised.

The probabilistic output provided by the test against the filter is in its most general sense a probability distribution over two possible cases -"not on the list" or "on the list". In many embodiments, there may be a limited number of possible outputs. For example, the output from the filter might be binary, and the filter may be guaranteed not to return "false negative" matches but may have a small probability of returning a "false positive" match. In such a case, the two possible outputs from the filter are as follows: First possible output: "Not on the list" -i.e. 0% probability that candidate password is on the list; 100% probability that candidate password is not on the list.

Second possible output: "Probably on the list" -i.e. the filter has indicated a positive hit, but the filter has a certain false positive rate and so this is not a definite indication.

A filter which guarantees against false negatives but allows a small number of false positives is particularly useful in this application, since any password which is on the compromised list will be rejected and thus all users are protected against the vulnerability inherent in choosing a compromised password. In most embodiments, the implication of a false positive will usually be that a user is forced to choose a different password, and a small number of false positives does not therefore cause a particularly large inconvenience.

It should be noted that the probability distribution over "on the list" and "not on the list" implied in the output of this kind of filter depends not only on the false positive rate of the filter but on the underlying probability of a candidate password being on the compromised list. In a particular environment (perhaps a business or organisation where all users are very security-conscious and scrupulously do not re-use passwords), the underlying probability of a candidate password being on the compromised list might be very low, perhaps 1%. If the false positive rate of the filter is also 1% then the "probably on the list" output actually corresponds to only a 50% probability that the candidate password is on the list. If the underlying probability of a candidate password being compromised is lower than the false positive rate of the filter, then the probability distribution indicated by the "probably on the list" output may actually be less than 50% probability that the candidate password is on the list.

Nevertheless, given the small inconvenience in having to choose a different password, such a probability distribution may be perfectly acceptable.

It is thought that many common environments will have a significantly higher underlying probability of users attempting to choose compromised passwords, and although only an example, a 1% false positive rate is thought to be suitable in many embodiments. It will be appreciated that since the implication of a false positive is that a user is slightly inconvenienced by having to try again to create an acceptable password, and the implications of multiple false positives (say, two in a row) is that the user will be inconvenienced in this way twice, the false positive rate is actually a more useful metric as to the usability of the system than the probability that a positive result is a "true positive" or not.

An example of a suitable probabilistic data structure is a Bloom filter. This type of filter guarantees no false negatives, and the false positive rate can be tuned by selecting parameters. Just under 10 bits per element are required to achieve a 1% false positive rate, which brings even very large datasets well within the size of structure which can easily be loaded into memory in most authentication servers currently in use, including relatively low-specification and fairly outdated servers, and servers which have other functions running concurrently.

A further useful property of Bloom filters, and some other similar probabilistic filter data structures, is that multiple filters can be combined to create a filter corresponding to the union of two or more individual lists of passwords or transformations. With a Bloom filter, a new filter corresponding to the union of two sets is created by a bitwise OR operation on two filters each corresponding to one of the sets. This means that compromised password lists (where "compromised" in some embodiments might simply mean that the password has been used before, whether or not there is a chance that it could have been stolen) could be created by different organisations, each having access to confidential data sources including passwords. Each one of those organisations can create and share a Bloom filter, without sharing the confidential passwords. All of those Bloom filters may then be combined to create one filter, which tests against compromised passwords from multiple data providers, without any individual data provider having to share their confidential data with anyone else at all.

Other embodiments may use other types of probabilistic data structures which have the required properties, for example a cuckoo filter.

The invention may be used to test passwords either at the point when they are first set, or in the process of being set, by a user, or at a later stage while the passwords are stored on the system. Preferably, both of these are done. When a user first tries to set a password, the system will simply reject that password if it is found to be probably on the list of compromised passwords and require the user to try again with a different candidate password. Where the check is done at this stage, the system will know the plaintext password that the user tried to set, which was rejected as a result of appearing on the compromised list. The system may then store the rejected plaintext password and add it to a list of plaintext reject passwords which are checked in addition to the check against the probabilistic data structure. An advantage of keeping a plaintext reject password list is that "fuzzy matching" against the list can be used, for example a candidate password could be checked against similar passwords on the plaintext rejected password list using the Levenshtein Distance or a similar metric or approximate string-matching algorithm. This would avoid users whose password has been rejected trying again with a small permutation, for example adding a "1" to the end.

A further plaintext reject list may be preconfigured in the system, for example listing organisation-specific words or acronyms. For example if used by the UK Intellectual Property Office, the reject list might include for example "EPOQUE", "PROSE", "PDAX", "Ipsum", etc. Fuzzy matching against this preconfigured word list attempts to prevent users from choosing obvious or likely-duplicate passwords, although such organisation-specific terms are unlikely to be included in general-purpose lists of compromised passwords.

The further plaintext reject list with organisation-specific words may be stored with, for example in the same file as, the plaintext reject password list. A single plaintext reject list may start with this preconfigured organisation-specific list when the system is first installed, and then grow as users' candidate passwords are rejected by the system for any reason.

The plaintext reject list may be normalised, for example by converting all letters to lowercase. Further norrnalisation rules may for example convert symbols and numbers to letters. E.g. "Pa$$wOrd" might be normalised to "password" before being stored in the plaintext reject list, using a lowercase conversion together with simple substitution rules for $ and 0. A candidate password may be norrnalised by the same set of rules before being fuzzy matched against the plaintext reject list. In some embodiments, both the normalised candidate password and the original candidate password are fuzzy matched against the plaintext reject list, and if either is found to be unacceptably close then the candidate password will be rejected.

Normalising the reject list and checking normalised passwords keeps the size of the reject list reasonably small. This allows for the advantages of fuzzy matching, which is not possible against the probabilistic filter data structure, whilst keeping the reject list within a reasonable size. Because of the way the reject list is generated, from within the system being protected, it will naturally tend to stay within a size proportionate to the overall scale of the system (i.e. number of users). Because of the normalisation step, the reject list does not need to be expanded by generating all permutations, keeping the size reasonable.

The system of the invention alternatively, or preferably additionally, checks passwords which have been set some time ago and are stored in the system. This checking may be periodic, for example daily, weekly or monthly. Preferably, the checking is re-done every time a more up-to-date list of compromised passwords, and therefore a more up-to-date probabilistic data structure, is available, so that if a user's password becomes compromised action may be taken straight away.

Almost all well-designed systems store one-way cryptographically-secure hashes of passwords, rather than the passwords themselves. Hence the items being checked against the probabilistic data structure in this case are hashes of passwords (transformations of passwords). When the probabilistic filter data structure is built, the same transformation is carried out on each compromised password before it is added to the filter.

A system which checks passwords both when the password is set and periodically when the password is being stored still only needs to have one probabilistic filter data structure loaded. A plaintext candidate password which is in the process of being set just needs to be transformed (e.g. hashed) using the same transformation which will be used to store it, before being checked against the probabilistic data structure. At the same time the raw plaintext can be checked, for example using fuzzy matching, against the plaintext reject password list.

Where a periodic checking process identifies a stored password which is -or might be -on the compromised list, a number of follow-on actions might be taken, depending on the risk profile of the relevant system. At one end of the spectrum, the user might receive an email advising them to change their password as soon as possible. Alternatively, their account may be locked out immediately forcing the user to seek assistance from an administrator to set a new password and unlock their account.

DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention, preferred embodiments will now be described by way of example only, with reference to the accompanying drawings in which: Figure 1 is a diagram showing how a probabilistic data structure may be built from a list of compromised passwords Figure 2 is a diagram showing how a single probabilistic data structure may be built from multiple probabilistic data structures, each of the multiple data structures corresponding to a list of compromised passwords; and Figure 3 is a flowchart showing the process of interactively testing a user's password.

DESCRIPTION OF PREFERRED EMBODIMENTS

Referring firstly to Figure 1, the process of creating a probabilistic data structure from a list of compromised passwords 10 is illustrated. The list of compromised passwords might be obtained from a variety of sources, and might be a concatenation of multiple lists from multiple data sources. The passwords on the list are passwords which are known to be "insecure", for example because they have been obtained by hackers in one way or another and published.

Each password is transformed by a password hashing algorithm to obtain a list of transformed passwords 12. Examples of hashing algorithms include for example MD5, SHA-1, bcrypt. The hashing algorithm should be chosen according to the hashing algorithm Of any) in use to store passwords in the system to be tested. For example, Active Directory stores passwords as several different types of hash to support different legacy clients, including storing passwords as an MD4 hash. Note that some of the listed hashing algorithms are now considered vulnerable and insecure.

However, if a target system uses one of those algorithms then that is the algorithm which must be used to create the transformed compromised password list 12. Some systems may store passwords without hashes at all, for example to allow only a subset of characters from a password to be requested on login ("e.g. please enter the 31d, 5th and 6th characters from your password"). In this case, the compromised password list 10 does not need to be transformed before building a probabilistic data structure.

In this embodiment, the probabilistic data structure is a Bloom filter. The Bloom filter is illustrated in Figure 1 as a bit array 14. It will be appreciated of course that the probabilistic data structure (whether a Bloom filter or anything else) will in practical embodiments be far longer than the 16 bit array shown in the illustration. To achieve a 1% false positive rate, just under 10 bits per item is required. It may be preferable to allocate more than 10 bits per item when the system is first built, so that if more compromised passwords are obtained they can be added to the Bloom filter without redefining the Bloom filter as a larger data structure and rebuilding it from scratch.

A Bloom filter is created by hashing the item to be inserted by k different hashing algorithms, each hashing algorithm mapping the item to be inserted to a position in the array 14. On initialisation of the Bloom filter, all elements in the array 14 are set to zero. When an element is added to the Bloom filter, the elements at the positions in the array 14 defined by the output of the k hashing algorithms (k = 2 in the Figure 1 example) are set to one In the example shown, the first element added (d795a908 which is a hash by some algorithm of "password") is mapped to positions 1 and 10 by the two hashing algorithms used by the Bloom filter. I.e. H1(d795a908) = 1 and H2(d795a908) = 10. The second element added (e86f8957 which is a hash by the hashing algorithm of "12345") is mapped to positions 1 and 4 by the two hashing algorithms used by the Bloom filter. I.e. Hi (e86f8957) = 1 and H2(e86f8957) = 4. Note that for these two example inputs, the output of hashing algorithm Hi happened to be the same. It is also a possibility that for a particular input, the output of hashing algorithm H1 might be the same as the output of hashing algorithm Hz, in which case just one bit would be set by adding that particular element to the Bloom filter.

To find an appropriate size of the array 14 m and an appropriate number of hash functions k, the following relationship may be used: in k = -In 2 It VVhere n is the estimated maximum number of elements to be added over the lifetime of the filter (which may be more than the number of currently-available elements to be added as soon as the filter is created).

m, in bits, may be set to about ten times n to achieve a false positive rate of 1%.

Note that once the Bloom filter 14 is created, it is impossible to retrieve any particular password from the filter, or even to test with any certainty whether a particular password was on the original compromised password list 10 (it is only possible to confirm if a particular password was not on the original list 10). It is therefore not a security concern if the hashing algorithm used to create the transformed password list 12 happens to be an insecure one (which it will need to be if the system to be tested uses an insecure hashing algorithm) or even if no transformed password list is generated at all. The compromised password list 10 and/or the transformed password list 12 do not need to be used further as part of the system. It may be desirable to keep the original password list secret 10, or at least not to make it more readily available than it already is, and this system allows that to happen whilst taking advantage of the information in the compromised password list 10 to improve security in other systems.

Figure 2 shows how a Bloom filter 14 may be built to filter for items in three different compromised password lists 10a, 10b, 10c. The compromised password lists 10a, 10b, 10c may all be held by different entities, and each entity may wish not to share the contents of their own password list. In particular, this allows passwords which have possibly been compromised in an attack to be marked as "compromised" to other systems, without making those passwords publicly available (which in the eventuality that no compromise has actually happened, would make the problem worse).

Each entity holding a confidential compromised password list 10a, 10b, 10c creates its own Bloom filter 14a, 14b, 14c. In some embodiments, and as shown in Figure 2, this may involve first creating transformed password lists 12a, 12b, 12c.

Only the Bloom filters 14a, 14c, 14c need to be shared outside of the creating organisation, and no passwords can be retrieved from those Bloom filters. The Bloom filters 14a, 14b, 14c are combined by a bitwise OR to make a single Bloom filter 14 which filters for any element in any of the original compromised password lists 10a, 10b, 10c.

It will be noted that if there are duplicate passwords in the compromised password list 10 (as is fairly likely in a large system) or there are duplicates across different individual compromised password lists 10a, 10b, 10c, then this makes no difference to the eventual Bloom filter 14 which is created.

When the Bloom filter is created, it is loaded on to systems which need to test user passwords to check if they are compromised. In a simple embodiment, this is a check of a list of stored passwords (or, in most embodiments, stored transformations, i.e. cryptographic hashes of passwords) against the Bloom filter, to see if that password might have been on a compromised list or not. Such a check can be run extremely quickly simply by calculating each hash function H1... Hk for the password to be checked, and testing whether all the bits in the array at the indices corresponding to the output of the k hash functions are set to one. If at least one of the bits at these positions is zero, then the password was definitely not on the compromised list 10 (or any of the compromised lists 10a, 10b, 10c). If all of the bits at these positions are one, then the password might have been on the compromised list 10 (or one or more of the compromised lists 10a, 10b, 10c).

If a user's password on the system being tested is found to be possibly compromised, then different embodiments may take different actions. Ideally, for the best security, the affected user's account would be immediately locked out until a new password had been chosen.

In Figure 3, an interactive checking procedure is shown. This procedure may be carried out when a user chooses a password (i.e. when setting up their account for the first time, or when changing their password). The implication of carrying out the process when the password is first set is that at this stage the plaintext password is available to the system. In most systems, the plaintext password is not stored (rather, a hash is stored -hopefully a secure one), and so additional steps can be taken when the system has access to the plaintext password which are not possible in a periodic check against stored data. However, it will be appreciated that some systems which store passwords for example with reversible encryption in a hardware security module may be able to carry out some of these checks on stored data as well.

At step 20, a user enters his or her candidate password, i.e. the new password that the user wants to use to protect their account.

As an example, a user might choose as a password "MyP4$$wOrd".

At step 22, the password is normalised. This means carrying out substitutions according to stored rules. These rules may include for example, convert all uppercase letters to lowercase, convert the numeral 0 to the letter o, etc. As an example, the user's entered password "MyP4$$wOrd" might be normalised to 30 "mypassword".

Step 22 further includes checking the normalised password ("mypassword" in the example) against a plaintext reject list. The plaintext reject list is simply a stored list of plaintext passwords which are considered to be bad or insecure. The plaintext reject list is ideally fairly short, and is not intended to duplicate the purpose of the Bloom filter. However, it might well be initialized with a short list (say a few hundred) of extremely common passwords, perhaps along with some organisation-specific common words.

Matching against the plaintext reject list is by a "fuzzy" or approximate matching algorithm. This, combined with the norrnalisation procedure, helps to keep the plaintext reject list reasonably short. For example, the plaintext reject list may well contain "password" on its list of extremely common passwords. The normalisation procedure means that "P4$$wOrd" does not need to be stored as well, and the fuzzy matching algorithm means that "MyP4$$wOrd" or "Password1!" do not need to be stored either.

In some embodiments, both the normalised password and the original raw password may be fuzzy matched against the plaintext reject list, and in the case that either one matches the password will be considered insecure and will be rejected.

If the password is rejected then the flow returns to step 20 and the user is invited to try again with a different candidate password. In some embodiments, the unsuccessful candidate password which was rejected might be added to the plaintext reject list (sept 24 in Figure 3). This might be advantageous in that it prevents the user incrementally moving away from the rejected password by adding characters until the candidate password is just different enough that it does not fuzzy match the password on the plaintext reject list. On the other hand, it will result in the size of the plaintext reject list growing perhaps faster than is desirable, and might represent a security risk in itself by storing in plaintext passwords which users have tried to use. At some point a user may succeed in setting a password which is just different enough from the one that was rejected not to be fuzzy matched, and in that case there will be a stored plaintext password which is quite similar to a password which is live and in use on the system.

Whether storing the rejected password in the plaintext reject list at step 24 is desirable will therefore depend a lot on the particular security profile of the system, and especially the other security measures which have been put in place to protect the stored plaintext reject list. Although called the "plaintext reject list", when stored on disc and not in use by a running process, the list may of course be encrypted in a reversible way, with a securely stored key.

In some embodiments the rejected password may be added to a temporary plaintext reject list, for example which only lasts as long as one user's session attempting to change their password. This may to some extent mitigate concerns about storing these plaintext passwords and about growing the plaintext reject list too quickly, while still preventing a user from setting a password very similar to one which has been rejected. A separate permanent plaintext reject list might be provided for very common passwords and/or organisation-specific words.

At step 26, the user's candidate password is hashed. This assumes that the system stores passwords as one-way cryptographic hashes. Hashing the password at step 26, before checking against the Bloom filter, allows just a single Bloom filter to be provided for both interactive checking when a user first sets a password, and for periodic checking of stored passwords. The hash used at step 26 is the same hashing algorithm used to securely store the user's password in the system. For example, some systems may use the bcrypt hashing algorithm.

In some embodiments both the user's original candidate password and a normalised password may be hashed for checking against the Bloom filter.

e.g. bcrypt ("mypassword") could be calculated as well as bcrypt("MyR433wOrd"), the outputs to be individually checked against the Bloom filter. This provides a level of protection against users not quite re-using exactly the same password as a compromised password, but using very similar passwords. If this is to be done then it should be taken into consideration when choosing the acceptable false positive rate of the Bloom filter (since the false positive rate of the overall check is essentially doubled if two different inputs are to be tested against the Bloom filter as part of checking one candidate password).

At step 28 the hash (or hashes if both the normalised and raw candidate password is to be checked) is tested against the Bloom filter. In the general case, the output of a probabilistic data structure is a probability distribution over "on the list" and "not on the list". In the case of a Bloom filter there can be no false negatives but false positives are possible, so the output is either "not on the compromised list" or "possibly on the compromised list".

If the hash (or both hashes where both normalised and raw versions are checked) is not on the compromised list then at step 30 the password setting / changing procedure is completed. In most systems this is by storing the hash of the user's password in a database so that it can be used to authenticate future login attempts.

If the output of the check against the Bloom filter is that the candidate password, (or where it is also checked, the normalised candidate password) is possibly on a compromised password list, then the password is rejected and the user will have to choose a different candidate password. The rejected password in some embodiments might be added to a plaintext reject list (step 32). In some embodiments a normalised version of the password is store in the plaintext reject list as well as, or instead of, the raw candidate password.

Storing a rejected password in the plaintext reject list (which is fuzzy matched to candidate passwords on subsequent attempts) helps to prevent the user from setting a password which is similar to, but not quite the same as, a password on the compromised passwords list. However, as discussed above, in some embodiments it will be considered undesirable to store rejected passwords in this way. This is particularly the case considering the possibility of false positives in the Bloom filter output, and a false positive hit against the Bloom filter does not mean that the input was anything at all close to an item on the original compromised passwords list. Again, rejected passwords at this stage could be added to a temporary plaintext reject list which lasts only as long as it takes the user to complete the process and succeed in setting a new password.

In some embodiments, a normalised version of a rejected password could be added to the Bloom filter. This does not store the password in a way which is ever retrievable but provides a level of protection against a user trying to set a very similar password (i.e. one which will be normalised to the same string) as one which was rejected as being on the compromised list.

It is not of course possible to fuzzy match against a Bloom filter but storing normalised passwords in the filter, and testing normalised passwords against the filter, allows for some class of similar passwords to be rejected.

In some embodiments the lists of compromised passwords 10, 10a, 10b, 10c might be normalised before being added to the filter in the first place, all testing against the filters being with normalised candidate passwords.

The invention allows system operators to audit and improve the security of their systems by giving assurance that users are not using passwords which are on known compromised lists. It also allows responsible system operators to collaborate to avoid password re-use, especially but not exclusively when there is a suspicion that credentials have been compromised, without sharing confidential data.

It will be appreciated that modifications and improvements may be made to the embodiments described, without departing from the scope of the invention as determined by the claims.

Claims

CLAIMS1 A method of testing passwords on a computer system for presence of compromised passwords, the method comprising the steps of: obtaining a list of compromised passwords, or a list of transformations of compromised passwords; building a probabilistic filter data structure corresponding to the list of compromised passwords or transformations, the probabilistic filter data structure not storing any item in the list of compromised passwords or transformations, and having a total size smaller than the total size of the list of compromised passwords or transformations, and the probabilistic filter data structure having the property that it is possible to test a candidate password or a transformation of a candidate password for its presence in the list of compromised passwords or transformations, and obtain a probabilistic output as to whether the candidate password is on the list or not, loading the probabilistic filter data structure onto the computer system to be tested, testing at least one candidate password or transformation of a candidate password on the computer system against the probabilistic filter data structure, and indicating whether the candidate password is probably on the list of compromised passwords, or probably not on the list of compromised passwords.
2. A method as claimed in claim 1, in which the output of the test against the probabilistic data structure is either that the candidate password is probably on the list of compromised passwords, or definitely not on the list of compromised passwords.
3. A method as claimed in claim 1 or claim 2, in which the probabilistic filter data structure is a Bloom filter.
4. A method as claimed in any of the preceding claims, in which multiple lists of compromised passwords or transformations of compromised passwords are obtained, and in which a probabilistic filter data structure is built for each of the lists of compromised passwords / transformations, and in which a single combined probabilistic data structure is created by combining the plurality of probabilistic filter data structures, and in which the single combined probabilistic filter data structure is loaded onto the computer system to be tested.
5. A method as claimed in any of the preceding claims, in which a password is tested as part of an interactive routine in which a user is setting or changing a password by choosing a candidate password.
6. A method as claimed in claim 5 in which, when output indicates that the password is probably on the list of compromised passwords, the candidate password is rejected and the user is invited to try another candidate password.
7. A method as claimed in claim 6 in which the rejected candidate password is stored on a plaintext reject list, and in which further candidate passwords chosen by the user are tested against the plaintext reject list by an approximate matching algorithm, and in which an approximate match against the plaintext reject list will result in the further candidate password being rejected.
8. A method as claimed in claim 7, in which the plaintext reject list is stored temporarily only for the duration of an interactive routine in which a single user is setting or changing a password.
9. A method as claimed in claim 7 or claim 8, in which rejected passwords are normalised by a normalisation algorithm before being inserted into the plaintext reject list, and in which candidate passwords are normalised before being tested against the plaintext reject list by the approximate matching algorithm.
10. A non-transient computer readable medium containing instructions which when executed on a processor carry out the method of any of claims 1 to 9.
11. A system for interactively testing passwords on a computer system for presence on a list of compromised passwords, the system comprising: a probabilistic filter data structure corresponding to the list of compromised passwords, the probabilistic filter data structure not storing any item in the list of compromised passwords or any transformation of any item in the list of compromised passwords, and having a total size smaller than the total size of the list of compromised passwords or transformations, and the probabilistic filter data structure having the property that it is possible to test a candidate password for its presence in the list of compromised passwords or transformations, and obtain probabilistic output as to whether the candidate password is on the list or not; a computer system loaded with the probabilistic filter data structure, and configured to: accept input of a candidate password from a user; test the candidate password for its presence in the list of compromised passwords by using the probabilistic filter data structure; and if the candidate password is probably not on the list of compromised passwords, accepting the candidate password and storing the candidate password on the computer system for future authentication of the user, or if the candidate password is probably on the list of compromised passwords, rejecting the candidate password and returning to the step of accepting input of a candidate password from the user.
12. A system as claimed in claim 11, in which if the candidate password is rejected, the candidate password is stored in a plaintext reject list
13. A system as claimed in claim 12, in which the candidate password is tested by an approximate matching algorithm against the plaintext reject list, and in the case that the candidate password approximately matches any element on the plaintext reject list, rejecting the candidate password irrespective of the result of any test against the probabilistic data structure.
14. A system as claimed in claim 12 or claim 13, in which the plaintext reject list is stored only for the duration of an interactive session of a single user, the session ending when a candidate password is accepted.
15. A system as claimed in any of claims 12 to 14, in which a rejected password is normalised by a normalisation algorithm before it is stored in the plaintext reject list, and in which a candidate password is normalised before it is tested by the approximate matching algorithm against the plaintext reject list.
16. A system as claimed in any of claims 11 to 15, in which the output of the test against the probabilistic data structure is either that the candidate password is probably on the list of compromised passwords, or definitely not on the list of compromised passwords.
17. A system as claimed in claim 16, in which the probabilistic filter data structure is a Bloom filter.