NL2011730C2

NL2011730C2 - Email fuzzy hashing categorizing system.

Info

Publication number: NL2011730C2
Application number: NL2011730A
Authority: NL
Inventors: Andreas Jacobus Donselaar
Original assignee: Spamexperts B V
Priority date: 2013-11-05
Filing date: 2013-11-05
Publication date: 2014-10-14
Also published as: NL2011730A

Description

Email fuzzy hashing categorizing system Field of the invention

The invention relates to a system for categorizing flows of messages, a method for categorizing flows of messages, and a computer program comprising software code portions which, when running on a data processing system, performs the method.

Background of the invention

Email continues to be a highly popular communication method. The email SMTP protocol dates back to 1982, and hasn’t been significantly changed since. The simple protocol and low cost associated with sending and receiving emails explain its popularity; however, it makes the protocol also vulnerable for abuse. Email users receive a mix of different types of email, e.g. direct communication, spam, marketing, mass mail, commercial newsletter, social media, contracts, etcetera. There are numerous automatic ways of classifying email, each of them taking up a certain amount of computing power and resources to perform the classification.

Classification based on email metadata (such as the originating address of the sender) is resource-efficient, and can be effective, but is limited in accuracy, especially with the increase in email sent over IPv6, and for email submitted from local users. Classification based on statistical analysis of the content of the message is accuracy-effective, but can be very resource inefficient (i.e. require significant memory and CPU cycles) compared to other techniques, making it less feasible for large volumes of email classification and locating email in large datasets. In addition, the stored training data required for these techniques often includes private data (e.g. a credit card number could incidentally be captured). Classification based on applying a unique identifier (e.g. hash) to a message is resource-efficient and private, but only effective when many exact copies are being identified, not the near-exact copies that are often sent (classification) or being searched for (location).

EP1956777 describes a method and system for reducing the proliferation of electronic messages. An electronic message or a portion thereof is transmitted by the server system. A spam notification signal may be received related to the electronic message or the portion thereof. Access to said electronic message is restricted solely in response to receiving the spam notification signal. The system seems to lean heavily on user input for identifying messages as spam.

Summary of the invention A disadvantage of prior art is that with the increase in volume of spam messages, the costs in the sense of manpower, data storage and data processing also increase.

Hence, it is an aspect of the invention to provide an alternative method and system for categorizing messages, which preferably further at least partly obviates one or more of above-described drawbacks.

The invention thus provides a method for categorizing flows of messages, said method comprising receiving a flow of messages on a first server, splitting each of said messages into header information, message textual data, and message layout data, splitting said message textual data of each of said message into parts having an equal part length, said part length depending on the language of said message textual data, calculating for each message a message fingerprint comprising a series of ridges, said ridges calculated from parameters resulting from the application of a fuzzy hashing algorithm, sending said message fingerprints to a second server comprising a database of fingerprints and message classes/categories related to said fingerprints, looking up said fingerprints in said database using a fuzzy matching algorithm, said looking up providing a fuzzy match, determining a probability for said fuzzy match with respect to said fingerprints, labelling said message with said fuzzy match as its category if said probability indicates that said fuzzy match matches said fingerprint within a predetermined tolerance, and sending said message category for each message to said first server.

The invention further provides system for categorizing flows of messages, said system comprising: - a first server for receiving a flow of messages; - a second server comprising a database comprising fingerprints and message categories related to said fingerprints; - a fingerprinting device on said first server, said fingerprinting device adapted for splitting each of said messages into header information, message textual data, and message lay-out data, splitting said message textual data of each of said message into parts having an equal part length, said part length depending on the language of said message textual data, calculating for each message a message fingerprint comprising a series of ridges, said ridges calculated from parameters resulting from the application of a fuzzy hashing algorithm; - a first transmission device, coupled to said fingerprinting device and provide on said first server for transmitting said fingerprint to said second server; - a lookup device on said second server for receiving said fingerprint and looking said fingerprint up in said database, said lookup device adapted for generating a match probability and at least one message category; - a second transmission device, coupled to said lookup device and provide on said second server for transmitting said match probability and said category to said first server.

The invention further provides a computer program comprising software code portions which, when running on a data processing system, performs the method of the invention. The invention further relates to a data carrier provided with that computer program, and a signal carrying at least part of said computer program.

The method allows a reduction in cost, in the sense of data storage and data processing capacity. Furthermore, most of the messages can be classified without intruding into the privacy of either sender or receiver of the messages.

Further particular embodiments are for instance described in the depending claims

The innovation describes a new method based on fuzzy hashing that allows classification, categorisation, and locating emails in large quantities accurately with minimal computing resources. The system continues to self-improve based on both automatic and manual feedback loops, which is an inexpensive way of growing the datasets and continuously improving the system. This invention forms the basis of the product lines of SpamExperts, as it allows SpamExperts to provide highly accurate email classification and archiving services whilst minimising both computing resources and human working hours to manually update and fine-tune such classification systems. It can be compared to taking a small fingerprint of an email, that is not only unique for that specific email but indicates similarity with other emails as well, in the same way that DNA can identify not only an exact individual match, but also close relationships (e.g. familial).

This invention allows to accurately group messages, for instance emails. Examples of possible categories are ‘communication’, ‘spam’, ‘marketing’, ‘mass mail’, ‘commercial’, ‘newsletter’, ‘social’, ‘contracts’. It further allows to locate specific emails in a large dataset. The invention is very computer resource friendly, allowing its tasks to be done more efficiently than existing systems. This may save costs on computing power and energy, and reduces the environmental footprint of the service.

In particular for email, the invention proved very advantageous. Every email is unique and consists of a combination of header information, providing technical metadata and a message body, providing the actual content of the email. Except for technical messaging formatting standards there are not many rules on how to build up the content of an email, and a lot of violations of the best practise. This makes it hard to categorize emails using an automated process. Although various statistical methods can be applied for message identification, they require a lot of computing resources for each and every message, as well as significant resources for storage (and generally transmit) of learned databases. Such statistical models can be effective, however when processing billions of emails this gets inefficient and expensive. In addition, these methods typically work by breaking a message into tokens (e.g. words) and storing maps of tokens to token counts; as such personal and private data is often incidentally captured as tokens.

This invention describes an advanced method of fuzzy hashing to categorize and locate emails in an effective way. The hashing databases are built up using information from external feedback loops (such as automatic and manual reports), which generate the database to allow for fuzzy matching. The more messages are processed by the system, the more accurate it can classify messages without requiring additional expensive computing resources. The classical “expensive” checks, such as Bayesian, CRM114, and rule based virus/spam scanning allow to continuously improve the “fast fuzzy hashing checks” through auto-classification and training.

In general, a hash function is any algorithm that maps data of variable length to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

In particular, fuzzy hashing can be used to match data or data elements, like strings, that have similarities, such as two sets of data with sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length, for instance hash values. For instance, data elements or combinations of data elements that have similarities, within a predetermined distance result in the same parameters. These parameters are in turn used for calculating the ridges.

The current method uses fuzzy matching. In particular, “fuzzy match” is also referred to as “approximate matching”. When referring to strings, it may be referred to as “approximate string matching”. In this respect, it may be redefined as finding a string that has a predefined distance to target or pattern string. One possible definition of the approximate string matching problem is the following: Given a pattern string P

= pip2-..pn and a text string T = tit2___tn, find a substring Tj j= tj ...tj in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

For instance, the closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are: insertion: cot —► coat deletion: coat —► cot substitution: coat —► cost

These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted: insertion: co*t —► coat deletion: coat —» co*t substitution: coat —► cost

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation. Changing cost to cots is an example of a transposition.

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is “coil”, “foil” differs by one substitution, “coils” by one insertion, “oil” by one deletion, and “foal” by two substitutions. If all operations count as a single unit of cost and the limit is set to one, “foil”, “coils”, and “oil” will count as matches while “foal” will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.

The UDP protocol is used for the communication (from the server attempting to locate or classify the message, to the server that holds a central store of known fingerprint data) of both the feedback and the fuzzy hashing check to further minimize the resource footprint. In general UDP, User Datagram Protocol, uses a simple transmission model with a minimum of protocol mechanism. UDP is a network protocol used for the Internet, described in RFC 758. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an Internet Protocol (IP) network without prior communications to set up special transmission channels or data paths. UDP may provide checksums for data integrity, and port numbers for addressing different functions at the source and destination of the datagram.

The fuzzy hashing system in the end simply adds a highly effective first layer of classification; if it is unsure a more resource-expensive classification, as outlined above, will be used to classify the email and report back the classification to the fuzzy hashing engine so future similar emails can be identified more efficiently again.

A fuzzy hashing algorithm creates integer values, referred to as ridges, which are generated based on a mixture of normalised headers and normalised textual contents of an email.

The stream of normalised textual content of the message is broken into segments of a length determined by the primary language of that segment of the message. This length is either specified in the message metadata, or algorithmically determined. The length is determined by the primary language of that segment of the message in order to compensate for variations in typical word length between languages (especially between Germanic languages and Asian). Some textual content in a message is intended for human consumption (e.g. mark-up information and style information). These segments in a message are treated as separate languages in this respect.

The segments are shuffled with a deterministic sorting algorithm. If the same segments were processed twice, they would end up with the same order, but this is not related in any way to the order the segments appear in the message. Examples of deterministic sorting algorithms are for instance quicksort, heapsort, mergesort, bubblesort. A number of the segments are selected for use in the message fingerprint, and the rest discarded; the number is based again on the languages present (or apparently present) in the message, and the overall length of the message. Thus a message in English is likely to have fewer but longer segments than a message in simplified Chinese (1¾¾). Simplified Chinese (1¾¾) in comparison with English has more but shorter segments.

Each segment is converted by a platform-specific hash algorithm to for instance a 32 bit unsigned integer, the ridge. In an embodiment, Python’s built-in hash may be used. Although this conversion increases the possibility of hash collision, especially on a 64 bit platform, the overall algorithm is designed such that collisions of a small number of ridges will not influence the final result. Using a 32 bit unsigned integer means that the storage and transmission requirements for a fingerprint (and database of fingerprints) is extremely small (i.e. can easily fit completely within memory on even low-resource servers), and as the hash is essentially irreversible, no sensitive data is transmitted or stored. Each of the ridges is combined to form the overall signature (“fingerprint”) for the message. The system currently sends each ridge as part of a basic C array structure, but previous versions have used Python sets, Python tuples, concatenation into a string, etc. In an embodiment, the ridges are examined as a group.

When classifying a message, or locating a message within a large dataset, a fingerprint is generated for the source message. The fingerprint is transferred to a central lookup location, which returns the number of ridges in the fingerprint that are recognised within various categories of message (or, when locating a message, returns the location of the subset of messages where a message of this type would be stored).

Once a response has been received from the lookup location, the system making the query is able to evaluate the “fit” of the fingerprint to each of the various categories of message, and determine whether any are suitably close such that it is able to decide that the message belongs to that category, by being suitably close, i.e. the number of recognised ridges (as outlined in the previous paragraph) relative to the total number of ridges within the message is within a user-configurable threshold. Alternatively, it may be decided that further, for instance more resource-expensive, classification is also required. As the result of the response in an embodiment is a non-binary value, the system in such an embodiment is easily tuned to respond to the amount of training the system has experienced for data of each type. For instance, both global and local databases may be queried, where a global database has had years of training, while a local database may have only been in use for a few days. It may depend on the sensitivity that is required. For instance, a user may find it acceptable for a promotional message to be classified as spam, but not acceptable for a personal message to be classified as commercial. In addition, a message’s fingerprint may indicate that the message belongs to multiple categories, for instance social commercial mail. The fingerprint may indicate that the message may belong to one of a smaller set of categories. For instance, the fingerprint may indicate that the archived message is most likely to be in location A, and if not likely to be in location B, but almost certainly not in locations C through Z.

It may occur that a match of a fingerprint has a low probability of matching. In an embodiment, if match after further algorithm has a low probability, then said message is transmitted to a human classifier and enters a manual feedback loop. After comparing the fingerprint to a number of groups (e.g. in the simplest case, to ham and spam), a human would only be involved if there was a low probability of being part of any group where membership was being considered. In fact, in a well-trained system, this only occurs in rare instances. In an embodiment, if probability is below 5% after application of a further algorithm a human may be involved for further classification.

In an embodiment, if a match has a low probability after matching the fingerprint, in particular a probability below 1%, then said message is transmitted to a human classifier and enters a manual feedback loop.

Thanks to increasing amounts of messages that are processed over time, regular further innovations are possible of the various algorithms used, the underlying database infrastructure, and the efficiency of the lookup and judging system.

The term “substantially” herein, like in “substantially consists”, will be understood by and clear to a person skilled in the art. The term “substantially” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective substantially may also be removed. Where applicable, the term “substantially” may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%. The term “comprise” includes also embodiments wherein the term “comprises” means “consists of’.

Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The devices or apparatus herein are amongst others described during operation. As will be clear to the person skilled in the art, the invention is not limited to methods of operation or devices in operation.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device or apparatus claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The invention further applies to an apparatus or device comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings.

The various aspects discussed in this patent can be combined in order to provide additional advantages. Furthermore, some of the features can form the basis for one or more divisional applications.

Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, and in which:

Figure 1 schematically depicts an embodiment of a system for categorizing messages.

The drawings are not necessarily on scale.

Description of preferred embodiments

Figure 1 schematically depicts a system according to an aspect of the invention. The system comprises a fuzzy hashing database 1. In the fuzzy hashing database 1 all the hashing information is stored in separate classification categories. The database 1 is continuously updated with emails converted to new fuzzy hashes by both automatic email classifiers 2 and manual email classification submissions 3. An email 4 will be converted using the fuzzy hashing algorithm 5. The fuzzy hashing algorithm 5 takes both elements from the email header information and the email body. The result is then classified using the fuzzy hashing lookup mechanism 6. This fuzzy hashing lookup mechanism performs a lookup (7) to the database. A special system judges the response 8. Based on a comparison, the system is either confident about the classification 9, or it passes the email on to further, often more resource intensive, classification systems 2. These classification systems 2 in an embodiment will automatically feedback their classification results into the fuzzy hashing database 1. Besides such automated classification feedback, also manual feedback 3 is possible to correct any mistakes.

It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection. Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this invention and are obvious combinations of prior art techniques and the disclosure of this patent.

Claims

A method for categorizing a stream of messages, the method comprising: - receiving a stream of messages on a first server; - splitting each of the messages into header information, textual message data, and message format data; - splitting the textual message data from the messages into parts with an equal part length, the part length being dependent on the language of the textual message data; - calculating for each message of a message fingerprint, comprising a series of thresholds, each of those thresholds calculated from parameters resulting from the application of a "fuzzy hashing" algorithm; - sending the message fingerprints to a second server comprising a database of fingerprints and message categories related to the fingerprints; - searching the fingerprints in the database using a "fuzzy matching" algorithm, whereby the searching yields a "fuzzy match"; - determining a probability for the "fuzzy match" for the fingerprint; - labeling the message with the "fuzzy match" as a category when the probability indicates that the "fuzzy match" matches / matches the fingerprint with a predetermined tolerance, and; - sending the message category for each message to the first server.

The method of claim 1, wherein the set of thresholds are calculated using integer values that result from the application of the "fuzzy hashing" algorithm.

Method according to one or more of the preceding claims, wherein when the probability indicates that the "fuzzy match" does not match the fingerprint within the predetermined tolerance, further classification algorithms are applied which give corrective feedback to the "fuzzy hashing" database.

Method according to one or more of the preceding claims, wherein the fingerprints are communicated from the first server to the second server using the User Datagram Protocol (UDP).

Method according to one or more of the preceding claims, wherein the thresholds comprise integer values.

Method according to one or more of the preceding claims, wherein the second server returns a number of recognized thresholds.

Method according to one or more of the preceding claims, wherein when a match has a low probability after applying the further algorithm, in particular when below 5%, or when a match has a low probability after matching the fingerprint, in especially below 1%, the message is sent to a human classifier and goes into a manual feedback loop.

A computer program comprising software code parts which, when executed on a data processing assembly, performs the method according to one or more of the preceding claims.

Data carrier provided with the computer program of claim 8.

10. Signal provided with at least a part of the computer program of claim 8.

An assembly for categorizing streams of messages, the assembly comprising: - a first server for receiving a stream of messages; - a second server comprising a database comprising fingerprints and message categories that are related to the fingerprints; - a fingerprint device on the first server, the fingerprint device being arranged to split each of the messages into header information, message textual data, and message formatting data, splitting the message textual data of each message into parts with an equal part length wherein the part length depends on the language of the message textual data, calculating for each message of a message fingerprint comprising a series of thresholds, the thresholds being calculated from parameters resulting from the application of a fuzzy hashing algorithm; - a first transmission device coupled to the fingerprint device and provided on the first server to send the fingerprint to the second server; - a lookup device on the second server for receiving the fingerprint and looking up the fingerprint in the database, the lookup device being adapted to generate a match probability and at least one message category; - a second transmission device, coupled to the lookup device and provided on the second server for sending the match probability and the category to the first server.