GB2425855A

GB2425855A - Detecting and filtering of spam emails

Info

Publication number: GB2425855A
Application number: GB0508296A
Authority: GB
Inventors: Martin Giles Lee
Original assignee: MessageLabs Ltd
Current assignee: MessageLabs Ltd
Priority date: 2005-04-25
Filing date: 2005-04-25
Publication date: 2006-11-08
Also published as: GB0508296D0

Abstract

A system for detecting if an unknown email 101 is spam by testing if the email may be similar to previously encountered spain emails by a rapid first-pass filtering by a heuristics engine (102) followed by rapidly searching for the presence of signatures previously identified in clusters of related spain emails with a signature filter (103). Only emails which pass the heuristics engine (102) and are not detected by the signature filter (103) are subjected to a computationally intensive calculation by the classifier (104) of shared information between features of the unknown email (201) and features from previously encountered emails. The signatures are generated by the metadata generator (106) by examining common metadata features between known spain emails (101) that are found to be so closely related through calculation of shared information by the classifier (104) that they can be clustered together and stored in a store of email clusters (107). The metadata generator 106 passes the identified signatures to the signature filter (103) and metadata features that best represent clusters of similar emails are passed to the exported metadata store (105).

Description

A METHOD AND SYSTEM TO IMPROVE SPEED OF DETECTION

OF RELATED SPAM EMAILS.

The present invention relates to a method of increasing the speed of spam detection by a computationally intensive measure of shared information.

Spam email causes increasing nuisance by flooding recipient's email inboxes with unwanted messages. Frequently the contents of the spam may contain fraudulent or explicit content and may cause distress or financial loss. The time spent dealing with these messages, the resources required to store and process them on an email system, and wasted network resources can be a significant waste of money. Numerous measures have been proposed to detect spam. However spammers have reacted to disguise their emails in an attempt to thwart spam detection measures.

Many of these attempts by spammers to disguise their emails simply mean that the emails are easier to spot by other methods. One particular class of spam which has been found to be difficult to detect using current methods are the advanced fee fraud emails, otherwise generally known as Nigerian 419 scams.

These emails are frequently sent from free email accounts and appear to be typed in manually. As such they strongly resemble standard email communication and are distinguishable only by their subject domain. While the presence of a few key phrases is an indicator of an email being an advanced fee fraud, this is far from an absolute indicator and cannot be used to identify such emails with certainty.

Manual inspection of these emails quickly identifies that many of these emails share certain details. Frequently there is a clear and strong relation between many of these emails suggesting that they were written by the same individual or group of individuals. The exact phrasing tends to change between each message, and from day to day, rendering detection by current methods problematic.

Manual examination of emails that appear to be related according to their similar content and wording, shows that there are some features of these emails which are strongly conserved within the group. Identification of these common features in a new email allows a rapid assignment of the email to a particular group. Otherwise emails can be sorted into groups by a more computationally intensive method by calculating the amount of shared information which is observed between the members of previously identified groups of emails and a new email.

According to the present invention there is provided a method of antispam filtering of emails and comprising the steps, executed by machine, of: a) selecting an email for processing; b) calculating the value of an indicator of the information shared between the email and reference data from a reference corpus of spam emails and; c) selectively flagging the email as spam depending on the value of the indicator calculated by step b).

The invention also provides a system of anti-spam filtering of emails and comprising the meanss, executed by machine, of: a) means for selecting an email for processing; b) means for calculating the value of an indicator of the information shared between the email and reference data from a reference corpus of spam emails and; c) means for selectively flagging the email as spam depending on the value of the indicator calculated by means b).

Other, optional, features of the invention are defined in the sub claims.

The invention is based on the concept of shared information in an objectively measureable, information theory, sense.

The paper "Shared Information and Program Plagiarism Detection" by Chan, Francia et. al., University of California, Santa Barbara, December 13, 2003 (http://monod.uwaterloo. calpapers/O4sid.pdf) describes the application of shared information measurement to the detection of plagiarism of program source code. This measurement is based on the concept of Kolmogorov complexity. The Kolgomorov complexity K(x) of a string x is the length, in number of bits, of the shortest program that, with no input, can generate x. Although Kolmogorov complexity is not computable, as described in the paper it is possible to derive a metric of Information Distance between two strings x and y based on the compressed sizes of the strings x, y and the compressed size of the concatenation of the strings, the more similar x and y are the smaller the calculated distance. The present invention preferably uses a normalised version of this metric as its measure of shared information.

From a practical point of view, the manner in which the calculating step b) is carried out is as follows:- A store of reference data is established, containing data characterising each of a number of clusters of email in the reference corpus, the members of a given cluster having previously been established to contain shared information. The reference data conveniently comprises the values of one or more strings which are selected in accordance with predetermined criteria. The criteria specify types of strings of interest which are characteristic of scam spam email for example.

The contents of the Subject field

Email addresses Capitalised strings Numerical strings The entire email body The string values stored as the reference data need not all be from the same email in the cluster. Preferably, for each string, the value stored in the reference data is that one which is most representative of the cluster, based on a measure of information shared between that value and the values of corresponding strings of other members of the cluster.

Thus, with the reference data established in the above described way, the calculation step b) can proceed on the basis of measuring the information shared between strings in the reference data and corresponding strings in the email under consideration.

Once the reference data has been established, the emails from which it has been derived could, in principle, be discarded. However, it is preferable to retain them so that the reference data can be dynamically regenerated since the contents of spam emails, scam spam emails in particular, changes over time. This dynamic regeneration can be done by executing a training algorithm which seeks to determine to which of the existing clusters an email known to be spam, belongs. If it does not belong to any existing cluster, in which case a new cluster may be created with that email as its (initially) sole member; the necessary reference data is, of course, generated from the new cluster.

The training algorithm preferably evaluates whether more than a predetermined proportion of its members have the same value for one of the strings of interest. If they do, that value can be used as a signature for the cluster and be stored in a signature store. This signature store can be used both by the training subsystem and the detection subsystem to prelilter emails submitted to them. If the email is found to contain one of the signatures stored in the signature store, it can be flagged as spam without the need for the computationally intensive shared information calculations. When a signature is added to the signature store, the reference data from the cluster from which it is derived can be discarded, so that there is, overall, less reference data against which an email needs to be evaluated.

The invention will be further described by way of non-limitative example with reference to the accompanying drawings, in which:- Figure 1 is a block diagram showing the training subsystem of one embodiment of the invention; and Figure 2 is a block diagram showing the detection subsystem of that embodiment of the invention.

Figure 1 illustrates the training sub-system one embodiment of the present invention.

Known spam 101 is passed to the heuristic engine 102 which searches for the presence of key words and phrases within the known spam 101. If the spam does not meet a predefined criteria for the presence of certain key words or phrases the known spam 101 is rejected and no further processing takes place, otherwise, the known spam 101 is passed to the signature filter 103.

The signature filter 103 tests for the presence of character strings known to be found in existing clusters of known spam in the store of email clusters 107. If the signature filter 103 detects the presence of a known signature string, the known spam 101 is added to the cluster from which this string was derived, and stored in the store of email clusters 107. If a known signature string is not found the email is passed to the classifier 104.

The classifier 104 extracts metadata from the known spam 101 and compares the metadata from the spam with the metadata taken from clusters of related known spam emails stored in the exported metadata store 105. The comparison is performed by calculating the amount of information shared between each of the components of metadata identified in the known spam 101 and components of metadata of the same type contained in the exported metadata store 105 by a compression based algorithm subsequently described.

In passing, it should be noted that throughout this specification and claims, the expressions "metadata" and "reference data" are used interchangeably.

If the known spam 101 is found to be similar to the metadata relating to an existing cluster, the spam is added to this cluster in the store of email clusters 107. If the known spam 101 is not found to be similar to the metadata of any existing cluster, a new cluster containing this spam as the sole member is created in the store of email clusters 107.

Following the creation of a new cluster or the addition of a new email to an existing cluster in the store of email clusters 107 the metadata generator 106 may recalculate the metadata for that cluster. If a cluster contains a large number of emails, the metadata generator 106 may skip the recalculation of metadata for the addition of a single email.

The metadata generator 106 identifies strings of digit characters in individual emails (i.e. the known spam email 101 and the members of the cluster under consideration); the digits may be contiguous or separated by non-alphabetic characters, in the bodies of the emails. Additional processing may impose restrictions on the digit strings, e.g. they must be more than a certain length, and no more than a certain proportion of the digits may be 0'. If more than a predefined proportion of the emails contain an identical string, this string is passed to the signature filter 103 for use as a signature which is taken to characterise membership of the cluster. Otherwise, the amount of shared information between each of the strings is calculated by a compression based algorithm subsequently described. The string which shows the highest amount of shared information with the other corresponding strings of the other emails of the cluster is exported to the exported metadata store 105 as the most representative string for the set of digit characters strings for the cluster.

If a signature has not been created for the cluster, the metadata generator 106 examines email addresses contained within the body of the known spam 101, as well as email addresses in the reply-to and from header fields. As with strings of digits, if more than a defined proportion of the emails in a cluster contain an identical email address, this is passed to the signature filter 103. Otherwise the most representative email address is calculated as for digit strings and passed to the exported metadata store 105.

Again, if a signature has not been created for the cluster, the metadata generator 106 examines the value of the subject header field of the known spam 101. If more than a defined proportion of the emails in a cluster contain an identical value for the piece of metadata, this is passed to the signature filter 103. Otherwise the most representative subject value is calculated and passed to the exported metadata store 105.

Similarly, in the absence of a signature having been created, contiguous strings of capitalised letters are identified within the body of the known spam 105 by the metadata generator 106. Only the first 5 such strings, of less than one line in length are considered. Identical such strings in a predefined proportion of the emails in a cluster causes this string to be declared as a signature and passed to the signature filter 103. If this is not the case, the most representative capitalised string is calculated and passed to the exported metadata store 105.

If no signatures are created, the bodies of the emails in the cluster are compared against each other and the most representative body within the cluster is passed to the exported metadata store 105 by the metadata generator 106.

If a signature was created during this process the metadata generator 106 may inform the exported metadata store 105 to remove all metadata relating to this cluster from itself.

Figure 2 shows the detection sub-system of the illustrated embodiment of the present invention.

An email 201 whose status as spam or not is unknown is inputted to the detection system from an email stream (not illustrated) which may, for example, be the series of emails arriving at a given node in a local-area or wide-area network.

The unknown email 201 is passed to the heuristic engine 102. If the spam does not meet a predefined criteria for the presence of certain key words or phrases the unknown email 201 is rejected by the system with no opinion if the email is spam or not.

Otherwise, the unknown email 201 is passed to the signature filter 103. If the signature filter 103 detects the presence of any signature known to it in the unknown email 201, the email is identified as spam and a signal sent to the spam output 203. If no signature is detected, the unknown email 201 is passed to the classifier 104.

The classifier 104 identifies various items of metadata within the unknown email 201 and compares these with items of metadata from clusters of related known spam emails held in the exported metadata store 105. For each item of metadata contained in the unknown email 201 the distance between the shared information is calculated according to a compression based algorithm subsequently described, between the metadata from the unknown email 201 and the metadata of the same type for all clusters stored in the exported metadata store 105. Metadata types may be email addresses, number strings, capitalised strings, the subject header value, the entire body of the email etc. If the unknown email's 201 metadata is found to share a significant amount of information with metadata derived from a cluster, a counter for that cluster is incremented. The increment may be modified to give more importance to certain types of metadata. If the value of the counter for any cluster passes a predefined level, the unknown email 201 is declared to be substantially similar to the members of the cluster and as such is spam, a signal is sent to the spam output 203 and no further processing takes place.

In response to the signal at spam out put 203 that an email is spam, remedial action may be taken; this may involve deleting the email, not forwarding it to its addressee or moving it to a special folder, e.g. on an email client.

If after completion of calculations by the classifier 104 no counter values have passed the predefined level, the unknown email 201 is found not to be substantially similar to any previously encountered spam, the unknown email 201 is declared as non spam and a signal is sent to the Nonspam output 204.

Substantial increases in the speed of processing unknown emails 201 by the detection sub-system 200 are made by: The identification of clusters of related known spam 101 emails allowing the identification of frequently encountered metadata within the cluster which can be used as signatures by the signature filter 103.

The inclusion of the signature filter 103 in the detection sub-system 201 identifies known spam early in the process before passing the email to the classifier 104.

The passing of metadata relating only to clusters of related spams that cannot be identified by signatures to the exported metadata store 105 by the metadata generator 106 means the classifier 104 has less items of metadata to compare to each incoming unknown email 201. This results in less time consuming calculations being made.

Description of Algorithm to Calculate Shared Information by Compression.

The algorithm is a modified version of that used to calculate lexical distance in the Chen paper mentioned above based on the concept of calculating algorithmic entropy also known as Kolmogorov complexity.

The general equation is: 1- K(x)- K(xy) d(x,y)= K(xy) Where d(x,y) is the shared information distance between x and y.

K(x) is the algorithmic entropy or Kolmogorov complexity of x.

K(xy) is the algorithmic entropy of the concatenated strings x and y.

K(x) - K(x(y) is the amount of information y has about x.

This can be shown to resolve to K(y)- K(x) d(x,y) K(xy) The algorithmic entropy K(x) is approximately equal to the size of x when compressed by a suitable compression algorithm.

Hence, the equation resolves to I f(y)I- I f(x) d(x,y) f(xy)i where If(x) I and f(y)I are the sizes of the compressed strings x and y respectively, after compression by algorithm f, and If(xy)I is the size of the compressed string of x and y concatenated by algorithm f.

When y=x, d(x,y) -* 0.

Wheny!=x, d(x,y)-*1.

The steps involved in the processing of the algorithm are as follows 1) Extract the strings x and y.

2) According to the nature and length of strings x and y, choose a suitable compression algorithm. The effectiveness of the compression algorithm has been found to depend on the length of the strings under consideration. Suitable compression algorithms include, Zlib, LZO, LZW etc. The skilled man will recognise these algorithms as examples of algorithms which perform variable length dictionary encoding using a sliding window compression buffer although other classes of algorithms may be used. Depending on the lengths of the strings under consideration, the size of the compression buffer window may also have an impact on the effectiveness of the compression algorithm.

3) The strings, x, y individually and x,y concatenated are compressed by the chosen compression algorithm, and the lengths of these compressed strings measured.

4) The shared information distance between the two strings is calculated according the equation above. It is important to note that this algorithm actually calculates the distance between information. That is to say, identical strings, which have a large amount of shared information between them, have very little distance between the information they hold i.e. the calculated value, tends towards zero. When two strings are substantially different, they have a low amount of shared information, therefore, the distance between the information they hold, i.e. the calculated value, will tend towards 1.

5) If the calculated value is below a predefined level, the strings, x and y, can be declared to be substantially similar.

Worked Examples

1. Extracting Metadata From Email.

The following is an example of a 419-type scam email.

Return-path: <adammboma3@tiscali.de> Date: Thu, 24Feb 2005 13:17:15 +0000 From: adammboma3@tiscali.de subject: NEXT OF KIN IS NEEDED To: adammboma3@tiscali.de MR. ADAM MBOMA DIRECTOR, AUDIT AND ACCOUNTING DEPARTMENT, AFRICAN DEVELOPMENT BANK.

LOME-TOGO WEST AFRICA

TELL:+228-91 16578.

Dear friend, I know that this letter may come to you as a surprise. I got your contact address from the Internet while I was searching for a business partner.

My name is MR. ADAM MBOMA. The Audit and Accounting Director of the African Development Bank, Lome-Togo. In my department I discovered an abandoned sum ofUSD$12,500.000.00

TWELVE MILLION FIVE HUNDRED THOUSAND UNITED

STATES DOLLARS. In an account that belongs to this our foreign customer who died along with his wife and children in the plane crash.

This client was among the victims of EGYPT AIR BOEING 767 FLIGHT NO.990 that crashed on the 31-10-1999 in U.S.A. Since we got information about his death, we have been expecting his Next of Kin to come over and claim his money because we cannot release it unless somebody applies for it as the Next of kin or relation to the deceased, as indicated in our banking guidelines but unfortunately I learnt that all his supposed next of kin or relation died alongside with him in the plane crash leaving nobody behind for the claim.

It is therefore upon this discovery that I decided to make this business proposals to you so that you will apply as the NEXT OF KIN or relation to the deceased for safety and subsequent disbursement since nobody is coming for it because we don't want this money to go into the bank treasury as unclaimed Bill. The banking law and guidelines here stipulates that if such money remains unclaimed after Six years, the money will be transferred in to the bank treasury as unclaimed fund. The request for foreigner in this transaction is necessary because our late customer was a foreigner and a Togolaise cannot stand as next of kin to a foreigner. I agree that 40% of this money will be for you as foreigner partner in respect to the provision of a foreign account. 2% will be set aside for expenses incurred during the business and 58% would be for me and after which I shall visit your country for disbursement according to the percentages indicated.

Therefore to enable the immediate transfer of this fund to you as we arranged, you will furnished me with a good receiving BANK ACCOUNT NUMBER, BANK NAME & ADDRESS where the money will be transferred into, your private telephone or mobile phone and fax number for easy communication. Upon receipt of your reply, I will send to you by fax or e-mail a text of the application which you shall Fill your banking details and fax to our foreign remittance manager, for easy execution of the transaction. I will not fail to bring to your notice that this transaction is 100% risk-free on both side. As all required arrangement have been made for the transfer and more so all the documents backing this claim will be supplied to you after you might have applied. Please I would like you keep this transaction confidential and as a top secret.

Trusting to hear from you through this mail address and my mobile phone number +228-9 116578 immediately.

Yours Sincerely, MR. ADAM MBOMA.

From this email, the system extracts the following strings as metadata Email address: adammboma3@tisca1i.de Numerical strings: 228-9116578 Capitalised Strings: MR. ADAM MBOMA DIRECTOR, AUDIT AND ACCOUNTING DEPARTMENT, AFRICAN DEVELOPMENT BANK.

LOME-TOGO WEST AFRICA

TWELVE MILLION FIVE HUNDRED THOUSAND

Subject: NEXT OF KIN IS NEEDED 2. Calculation of Shared Information.

Consider the following email addresses: 1) swisslotto@mail2world.com 2) swiss.lotto1virgilio.it 3) swisscyberlotteryvirgilio.it 4) swiss1ottonetscape.net 5) switzerlandlotto@netscape.net 6) swissclaims@netscape.com 7) swiss1ottowins@netscape.net 8) lotteryinterpro4netscape.net 9) switzerlottonetscape.com 10) univertrustagencnetscape.net Calculation of the degree of shared information between each pair of the email addresses gives the following matrix 1 2 3 4 5 6 7 8 9 10 1 0.091 2 0.576 0.094 3 0.649 0.378 0.081 4 0.515 0.594 0.676 0.103 0.657 0.686 0.703 0.343 0.086 6 0606 0.688 0.703 0.406 0.514 0.094 7 0.545 0.606 0.676 0.242 0.400 0.424 0.091 8 0.706 0.706 0.649 0.441 0.457 0.529 0.441 0.088 9 0.545 0.667 0.676 0.333 0.257 0.424 0.394 0.471 0.091 0.771 0.771 0.730 0.514 0514 0.543 0.514 0.543 0.514 0.114 - 12- It can be seen that when two strings are equal, the calculated value is ≤ 0.114. When two strings are unrelated, as with address 1 and address 10, the calculated value is 0.771 From the data, a suitable value for deciding if two addresses are related, would be, if the calculated value ≤ 0.48, and two addresses are identical if the calculated value ≤0.114 Applying this produces a matrix of: 1 2 3 4 5 6 7 8 9 10 1 * 2 * 4 * 6 + * 7 + + + * 8 + + + * 9 + + + + + * * where, * = identical strings, + = related strings.

From this matrix it can be seen that there are two distinct clusters. Email addresses 2 and 3 are related, as are addresses, 4,5,6,7,8,9. Within the second cluster address 6 is more distantly related to the other members of the cluster. Choosing a lower score at which to decide if two strings are related, would have excluded this address from the cluster.

- 13 - 3. Calculation of Most Representative Value.

For the above cluster of email addresses, 4 to 9, to calculate the most representative email address amongst these, we sum the total values of the calculated shared information distance.

Email Sum of shared Address no. information metric 4 1.868 1.857 6 2.391 7 1.992 8 2.173 9 1.916 Table of email address numbers and sum of the calculated distance of shared information between each pair of each email address in the cluster.

Since a low value from the described algorithm is related to a high amount of shared information, the email address with the lowest total sum of equation values is the most representative of the set. In this case email address 5 has the lowest sum value, and is therefore the most representative of the set 4. Example of Export and Detection.

Cluster I, consists of two emails with the following identified metadata: Email 1 Email address: adamm2@yahoo.com Numerical strings: 228-9116578 subject: PLEASE CONTACT ME Email 2 Email address: adammboma3@tiscali.de Numerical strings: 228-9116578 subject: NEXT OF KIN IS NEEDED For brevity, capitalised strings and message bodies have been omitted.

Cluster 2, consists of 2 emails with the following metadata: Email 3 Email address: swisslotto@netscape.net notification@swisswor1d1otto.net Numerical strings: 42 - 20-17 - 11 - 35 subject: Lottery Winner Email 4 Email address: switzerlandlotto@netscape.net bennardshaw@yahoo.co.au Numerical strings: 43-55-36-23 -44 subject: CONGRATULATIONS!!!! Cluster 1 contains a suitable signature string, 228-9116578' found in all the emails in the cluster. This string is passed to the signature filter 103. Therefore no metadata is passed to the classifier.

Cluster 2 does not contain a signature string, so the most representative metadata from this cluster is passed to the classifier 104.

This metadata consists of: Email address: swisslotto@netscape.net Numerical strings: 42 - 20 -17 - I I - 35 subject: Lottery Winner Now, consider an unknown email received by the detection sub-system.

The metadata of this email is: Email address: swisslotto@muchomail.com swissclaims@netscape.com Numerical strings: 7-14-17-23-31-44.

subject: Lotto Winning This email is passed through the system. - 15-

The heurisitic engine 102 examines the email and finds the presence of the strings winning', congratulation','lotto'. The email is not rejected by the system.

The signature filter 103 examines the email and does not find the string 228-9116578'.

The classifier 104 examines the email.

The email address swisslotto@muchomail.com' is calculated to have a distance of 0.5 from the metadata email address representative of cluster 2. This is slightly greater than our criteria for similarity of a score less than or equal to 0.48.

The email address swissclaims@netscape.com' is calculated to have a distance of 0.4375 from the cluster 2 metadata email address. This is within our criteria for being substantially similar, therefore the cluster 2 counter is incremented to record the similarity in metadata.

The subject header Lotto Winning' is calculated to have a distance of 0. 409 from the Cluster 2 metadata subject header. This value is within our criterion for being substantially similar. The cluster 2 counter is incremented again.

The number string is calculated to have a distance of 0.68 from the cluster 2 metadata number string, and is therefore not substantially similar.

The email is found to have two items of metadata substantially similar tothose of cluster 2. The email is judged to be similar enough to be judged a member of the cluster. The email is therefore declared spam, and a signal sent to the spam output 203.

5. Determination of Signatures Further insight into the operation of the illustrated system can be gained by considering that we have 5 emails in a cluster: 1) "LOTS OF CASH NOW, phone 555-1 112, further text explanation" 2) "LOTS OF CASH NOW, phone 555-i 112, further text explanation" 3) "LOTS OF CASH NOW, phone 555-11 12, further text explanations" 4) "LOTS OF CASH NOW, phone 555-1 112, further text exclamation" 5) "LOTS OF DOSH NOW, phone 555-1 112, further text exclamations" If we set our predefined proportion to be 100%, we have only one phone number based signature to represent the cluster, i.e. 555-1 112'.

If we set our predefined proportion to be 80%, we have two signatures, a capitalised text based LOTS OF CASH NOW', and a phone number based 555-1 112'.

If we place these two signatures in the signature filter 103 of the detection system, then we will catch all potential members of this cluster.

In the absence of the phone number signature, if we set our predefined proportion to be 100%, we do not have any strings which qualify as signatures, and we will rely on the entropy detection to detect potential members of this cluster. Setting the predefined proportion to 80% will result in the capitalised text signature, LOTS OF CASH NOW' being placed in the signature filter 103 of the detection system. In this case, 80% of members of the cluster will be detected and the others will pass through, since the presence of a signature for this cluster is taken as denoting there is no need for the computationally intensive entropy detection method.

The level of the predefined proportion' is a trade off between speed of detection and accuracy, signature based methods being quicker, entropy based methods being slower, but more accurate.

However this description does not take into consideration the larger picture, especially what is being received in the training sub-system 100.

What tends to happen is that the spammers, change their emails over time, this can be viewed as analagous to genetic evolution. The training subsystem is likely to encounter something similar to the following emails: 1) "LOTS OF CASH NOW, phone 555-1112, further text explanation" 2) "LOTS OF CASH NOW, phone 555-1113, further text explanation" 3) "LOTS OF CASH NOW, phone 555-1 114, further text explanations" 4) "LOTS OF CASH NOW, phone 555-1115, further text exclamation" 5) "LOTS OF DOSH NOW, phone 555-1116, further text exclamations" 6) "LOTS OF CASH NOW, phone 555-1117, further text exclamations" 7) "LOTS OF DOSH NOW, phone 555-1 118, further text exclaimed" 8) "LOTS OF DOSH NOW, phone 555-1 119, further text exclaimed" 9) "LOTS OF DOSH NOW, phone 555-1110, further text exclaiming" 10) "LOTS OF MONEY NOW, phone 555-1120, further text here" If we set our predefined level for declaring a signature, to be "100% occurrence when the cluster contains more than 3 emails", our training system will proceed like this Receive email 1, create new cluster, A, no signatures.

Receive email 2, similar to cluster A according to entropy, no signatures.

- 17 - Receive email 3, similar to cluster A according to entropy.

Capitalised string "LOTS OF CASH NOW" meets criteria for a signature, create signature for cluster A. Receive email 4, placed in cluster A by presence of a signature.

Receive email 5, does not contain signature for cluster A, create new cluster B. Receive email 6, placed in cluster A by presence of a signature.

Receive email 7, does not contain signature for cluster A, similar to cluster B according to entropy.

Receive email 8, does not contain signature for cluster A, similar to cluster B according to entropy.

Capitalised string "LOTS OF DOSH NOW" meets criteria for a signature, create signature for cluster B. Receive email 9, placed in cluster B by presence of a signature.

Receive email 10, does not contain signatures for cluster A or B, create new cluster etc. Although the emails are clearly related, they are tending to diverge over time. In this case, rather than placing all the emails together in a single cluster, we are directing the system to create new clusters for slightly different emails.

If we set our predefined criteria as "100% occurrence when the cluster contains more than 6 emails", then all the emails would be placed in the same cluster. The criteria for generating a signature would not be met, and the similarity of emails to this single cluster would be detected solely on the basis of entropy.

In brief, emails clearly related to an existing cluster but which do not contain that cluster's signature, are let through the detection subsystem as unrelated to a known cluster. If the email is encountered by the training sub-system then the email is used to create a brand new cluster.

This mechanism is used to enable the system to adapt to the evolution and adaptions the spammers make to their emails to keep them constantly changing.

It is important to note that the predefined level by which to define when a signature is declared for a cluster can be made to depend on the number of emails in the cluster.

The use of signatures, as by signature filter 103 as the sole criterion by which to flag and email as spam does, of course, result in false positives sometimes - 18- occurring. A signature is no more than a string of characters, the presence of which in an email is taken as an indication that the email is spam. In the example above, the string "228-9 1 16578" is all that constitutes the signature. However, consider the possibility that this string may occur in an email unrelated to this spam. For a random 10 digit number, disregarding the -, this is 10 to the power -10, i.e. 0.0000000001, although Benford's law suggests the observed probability would be slightly less than this figure. Nevertheless this is an extremely low false positive rate; signature based systems tend to have the lowest false positive rate of all classes of spam detection approaches.

As to the practical implementation and use of the illustrated system, a number of points should be noted.

Firstly, there may be a number of the Figure 2 detection subsystems running in parallel against different email streams and sharing the data of the signature filter 103 and exported metadata store 105. Further, for high volume processing, incoming emails may be dynamically routed for processing by whichever of the parallel detection subsystems is least busy at the time.

The detection subsystem(s) may be operated on a different computer, possibly on a different site, than the training subsystem of Figure 1.

Further, the detection subsystem of Figure 2 need not be the sole arbiter of whether an incoming email is classified and processed as spam. It may be part of a larger system which implements other spam-detection measures and in those circumstances its flagging of an email as spam may contribute to a "scoring" system in which scores from other anti-spam detection measures are also used in determining whether the email is sparn, e.g. when the overall score exceeds a given value.

It should be noted that although it is possible to use emails identified as spam by the detection system as input to the training system, in fact it is preferable not to do so as it might tend, inter alia, to reinforce any incorrect assumptions made by the system. Rather, it is preferred to use as input to the training system emails which have been independently determined to be spain e.g. by (human) inspection.

Claims

1. A method of anti-spam filtering of emails and comprising the steps, executed by machine, of: a) selecting an email for processing; b) calculating the value of an indicator of the information shared between the email and reference data from a reference corpus of spam emails and; c) selectively flagging the email as spam depending on the value of the indicator calculated by step b).

2. A method according to claim 1 where the indicator calculated in step b) is the information distance between data selected from the email and data selected from the reference corpus.

3. A method according to claim 2 wherein the information distance is calculated by compressing each of i) the data selected from the email, ii) the data selected from the reference corpus and iii) the concatenation of i) and ii) or vice versa, and calculating as the indicator the difference in the compressed sizes of items i) and ii) divided by the compressed size of item iii).

4. A method according to any one of the preceding claims wherein the reference data is obtained from a store of data charactensing each of a number of clusters of emails of the corpus, the members of the cluster having previously been established, on the basis of calculated information distance, to contain shared information, the characterising data comprising, for each cluster, the value of at least one string extracted from a member of the cluster and selected by predetermined criteria as being indicative of spam.

5. A method according to claim 4 wherein, in execution of the calculating step b) the degree of information shared is calculated in respect of at least one such string and a correspondingly selected string in the email of step a).

- 20 -

6. A method according to claim 4 or 5 wherein the reference corpus is subject to a training algorithm comprising the steps of: ti) selecting an email not in the corpus; t2) calculating the degree of information which the selected email shares with abstracted data of the clusters and; t3) if the degree of information which the selected email shares with one of the clusters exceeds a given value, adding it to that cluster.

7. A method according to claim 6 wherein the training algorithm comprises the step, when an email has been added to a cluster, of regenerating the reference data for that cluster.

8. A method according to claim 7 wherein the regeneration step comprises identifying, for each of a number of corresponding selected parts of the emails of the cluster, that selected part of one email which has the highest amount of information shared with the corresponding selected parts of the other emails of the cluster and storing, as the abstracted data for the cluster, the or each such identified part.

9. A method according to claim 6, 7 or 8 wherein, in execution of the training algorithm, if more than a predetermined number of members of a cluster are found to contain a string usable as a characteristic signature, that string is stored in a signature store.

10. A method according to claim 9 wherein, in execution of the training algorithm, the email of step ti) is examined to determine whether it contains a signature in the signature store and, if it does, steps t2) and t3) are omitted and the email is added to the cluster from which the signature was derived.

11. A method according to claim 9 or 10 wherein an email selected for processing by step a) is examined to see whether it contains a signature in the signature store and, if it does, steps b) and c) are omitted and the email is flagged as spam.

12. A method according to any one of the preceding claims and including the step d) of taking remedial action in respect of an email flagged as spam by step c).

- 21 -

13. A method according to claim 12 wherein the remedial action comprises not sending the email to its addressee(s), deleting it or moving it to a predetermined folder.

14. A system of anti-spam filtering of emails and comprising the meanss, executed by machine, of: a) means for selecting an email for processing; b) means for calculating the value of an indicator of the information shared between the email and reference data from a reference corpus of spam emails and; c) means for selectively flagging the email as spam depending on the value of the indicator calculated by means b).

15. A system according to claim 14 where the indicator calculated by means b) is the information distance between data selected from the email and data selected from the reference corpus.

16. A system according to claim 15 wherein the information distance is calculated by compressing each of i) the data selected from the email, ii) the data selected from the reference corpus and iii) the concatenation of i) and ii) or vice versa, and calculating as the indicator the difference in the compressed sizes of items i) and ii) divided by the compressed size of item iii).

17. A system according to any one of claims 14 to 16 wherein the reference data is obtained from a store of data characterising each of a number of clusters of emails of the corpus, the members of the cluster having previously been established, on the basis of calculated information distance, to contain shared information, the characterising data comprising, for each cluster, the value of at least one string extracted from a member of the cluster and selected by predetermined criteria as being indicative of spam.

18. A system according to claim 17 wherein, in execution of the calculating means b) the degree of information shared is calculated in respect of at least one such string and a correspondingly selected string in the email selected by means a).

- 22 -

19. A system according to claim 17 or 18 and including means for subjecting the reference corpus to a training algorithm and which comprises: ti) means for selecting an email not in the corpus; t2) means for calculating the degree of information which the selected email shares with abstracted data of the clusters and; t3) means which, if the degree of information which the selected email shares with one of the clusters exceeds a given value, adds it to that cluster.

20. A system according to claim 19 wherein the training algorithm comprises means when an email has been added to a cluster, for regenerating the reference data for that cluster.

21. A system according to claim 20 wherein the regeneration means comprises means for identifying, for each of a number of corresponding selected parts of the emails of the cluster, that selected part of one email which has the highest amount of information shared with the corresponding selected parts of the other emails of the cluster and storing, as the abstracted data for the cluster, the or each such identified part.

22. A system according to claim 19, 20 or 21 wherein, in execution of the training algorithm, if more than a predetermined number of members of a cluster are found to contain a string usable as a characteristic signature, that string is stored in a signature store.

23. A system according to claim 22 wherein, in execution of the training algorithm, the email of is examined to determine whether it contains a signature in the signature store and, if it does, operation of means t2) and t3) is omitted and the email is added to the cluster from which the signature was derived.

24. A system according to claim 22 or 23 wherein an email selected for processing by means a) is examined to see whether it contains a signature in the signature store and, if it does, operation of means b) and c) is omitted and the email is flagged as spam.

- 23 -

25. A system according to any one of claims 14 to 24 and including means for taking remedial action in respect of an email flagged as spam by means c).

26. A system according to claim 25 wherein the remedial action comprises not sending the email to its addressee(s), deleting it or moving it to a predetermined folder.

27. A method of anti-spam filtering of emails substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.

28. An system for anti-spam filtering of emails substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.