GB2445763A

GB2445763A - Metadata filtering

Info

Publication number: GB2445763A
Application number: GB0700926A
Authority: GB
Inventors: Neil Duxbury
Original assignee: Roke Manor Research Ltd
Current assignee: Roke Manor Research Ltd
Priority date: 2007-01-18
Filing date: 2007-01-18
Publication date: 2008-07-23
Also published as: GB0700926D0

Abstract

The invention comprises a method of validating and pre-filtering metadata items including compiling a set of characters present in the allowed metadata and checking whether one or more characters of the metadata to be filtered fall within the data set. In one embodiment, only a portion of the metadata item is checked for allowed characters. Where there are more than one metadata items the character set comprises characters from both.

Description

1 2445763 Metadata prefiltering This invention relates to a method of

processing metadata. Metadata is the term used for dat awbich represents data and includes email and web addresses.

The methodology can also be applied to a large number of other meta-data types including: URJJURL identification, SIP (Session Initiation Protocol) URI identification; E. 164 telephone number detection; Tag detection in other data formats, IP addresses, port range, protocol and session identifier detection, xml data structures, xml objects, HTML structures and objects, detection of content types and identification of content from packet payloads.

Email lookup against a dictionary of target addresses is important in SPAM filtering for rejecting mail from known SPAM agents. The process of email address lookup against a target database can prove to be a significant performance bottleneck. The principal reason for this performance bottleneck is the processing overhead associated with checking whether all email addresses extracted from a data sample are in a database of target email addresses.

Various techniques are known to extract email addresses. These e.g. use the character set defined by the standards for identifying an email address. However, for lookup only specific email addresses are required.

In reality the probability of obtaining a hit on the database with an arbitrary email address is < 1%. Consequently, 99% of the lookup effort is spent rejecting potential items. The main existing method for solving this problem is to extract all email addresses in the sample and look them up one by one. This has the drawback that around 99% of the addresses presented to the lookup phase are not in the dictionary.

Thus, the lookup algorithm spends most of its time rejecting potential matches. Such primitive methodology is shown in figure 1 which shown the methodology of having to search just one domain name, "bird. com" Some of this effort can be saved by reducing the number of items that are presented to the database for comparison. This is achieved according to the invention by pre-filtering the extracted email addresses based on the contents of the target dictionary.

This application describes an email address pre-filtering method. Although the methodology is designed specifically for application to email addresses the techniques employed can be used to identify many other structured forms of data, for

example URLs.

The invention comprises a method of validating and pre-filtering metadata items including compiling a set of character of the allowed metadata and checking whether one or more characters of the metadata fall within the data set. Only a portion of the metadata item maybe checked for allowed characters.

Where there are more than one metadata items and said character set comprises characters from both.

Both the characters and sequence of said metadata item or a portion thereof, maybe checked with those of metadata items of interest.

Alternatively a hash function is also used on part or all of the metadata items.

Email address lookup according to the invention is normally a two stage process involving first pre-filtering followed by if necessary comparison of the extracted email against a database containing target email addresses. The invention decreases the number of comparisons that must be made against this database by performing pre-flitering as part of the email address extraction phase. This application focuses on the various ways in which this pre-filtering may be achieved.

In this scenano the absence of the full spectrum of valid characters in the user and domain name pails means that fewer email addresses are identified by the extraction algorithm. However, by supporting the addresses defined in the dictionary we also ensure that those addresses we are interested in are successfully identified so they can be passed onto the lookup stage. This change reduces the number of items that are passed on to the full lookup phase and thus speeds up the overall pipeline of extraction and lookup.

Pre-filtering can be achieved in a number of ways including: a) pre-fihtermg based on a restricted character set. This increases performance by utilising the reduced character set defined by the dictionary entries when compared to the set defined in the standards. The reduced set means that hitting an email address is less lilcely.

b) pre-filtering based on a restricted character set and using the structure of the items in the target dictionary. The addition of structure to the filter means that hitting an email address is less likely.

c) the use of a state machine for the structured filtering.

d) the use of hashing for the structured filtering.

e) the use of a tree structure for the structured filtering.

f) the combined lookup and extraction algorithm utilising a tree structure. This effectively does away with the lookup stage all together and performs the extraction and lookup simultaneously.

Pre-filtering based on trees with skip vertices to reduce the memory overhead of the approach. This reduces the memory requirement of implementing the tree based approach whilst providing similar if not better statistics for email hit rate as the character and structure based approaches.

Example in

In this simple embodiment of the invention the pre-filtering comprises building up a character set and testing characters of the target metadata with the character set For example if one had a list of web sites (domain names) and "hotmail.com" was of interest then according to the method the character set would comprise {h, o, t, m, a, i, 1, ., c, 0. m} Where the metadata are domain names, one has to take into consideration the dots (.) Thus when building a search algorithm whenever a character was encountered which isn't part of the set then it can be disregarded (or regarded depending on the situation).

If e.g. there are two domain names of Interest (hotmail.com, google.com) the character set may include the characters from both.

In genera! the approach described here restricts the set of characters that can appear in an email address to those that appear in the target dictionary. Any detection algorithm based on this restricted set will then provide enhanced perfonnance as the probability of finding an email address composed of the restricted set is less than the probability of finding an email address composed of the full set. Thus, for an arbitrary sample fewer instances are passed on to the full lookup stage which enhances the overall performance. This methodology is useful when only looking for a small number of domains e.g. hotmail.com where it gives an increase in throughput in the extraction phase of -2'0%.

Example 2

Figure 3. illustrate this methodology. In effect the resulting state machine says find email addresses whose domain name is prefixed with the sequences in the set (bi, ba, Ii, la) and where the remaining suffix contains the characters in the set {rd.com) i.e. the end of the email address is compressed into a reduced number of states and the prefix is expanded to give enhanced filtering. This approach effectively limits the S amount of branching that can occur and simplifies the implementation of the approach.

This is more specific example than the example 1 methodology where the partial structure (sequence) as well as the character set is tested.

Example 2b

In addition the state machine (process) of the last example may be truncated, i.e. just checking for a prefix with "Ia" or "bi" Example 3, hash based pre-filtering This may use a hash function for a whole or part of the domain name (s) of interested.

The hash function may be any suitable one. For example each character may be assigned a numerical value and some operation performed. Hashing in certain circumstances is an effective and efficient method of testing metadata. Hashing may also be done for only a portion of the domain name. For example, hash over the first 4 or more characters of the email address.

Partial hashing may be done in conjuction with checking the remainder of the metadata for either valid characters or for the correct sequence. Figure shows a prefix hash with tail end state machine.

Example 4 Tree based pre-filtering In the state machine based pre-filtering method described previously does not make full use of the underlying structure contained by the email addresses in the target dictionary. In particular the mapping of several edges to a single vertex allows emails such as: hin@lird.com to defeat the filter. This is because it contains "bi" and all characters are part of the allowed character set. This disadvantage can be addressed by adding additional vertices to the compressed or full form of the state machine. This is illustrated for the domain name state machine in figure 5. This approach completely obviates the need for the lookup stage as the lookup and identification of an email address within the dictionary are done simultaneously. in this modification to the approach the lookup and extraction algorithms are merged so that the lookup is performed as an email address is extracted. Consequently, there is no longer any need for the Iookup phase. Thus, this design should result in the highest performance.

Example 4a

The method of path compression' can also be applied to the above approach.

Although this would still require a follow on look up stage the advantage of this method is that it would greatly reduce the amount of memory required to represent the set of dictionaiy email addresses. A path compressed version of the state machine shown in figure 5 is illustrated in figure 6. Within the path compressed variant of the data structure a skip value is added to each internal vertex. The skip value allows to algorithm to move forward and check the only possible ending for the appropriate branch. A mismatch at this position leads to rejection whereas a match passes the filter.

This modification saves memory by removing the internal nodes and minimises the number of characters that need to be looked at to detennine if the email address is worth looking up. This modification greatly reduces the number of comparisons that need to be made in the pre-filter and also significantly improves the pre-filtering for large dictionaries. Use of this method is expected to reject a large number of the candidate email addresses before they are looked up.

The general approach is to introduce pre-filtering into the email address extraction phase in order to avoid unnecessary comparisons against the target email address database. In general the approach increases the overall performance of the extraction lookup pipeline by reducing the number of items that are passed on to the lookup stage. As the lookup stage is the slowest point in the pipeline reducing the number of thnes it is called upon effectively increases the overall throughput.

Claims

Claims I. A method of validating and pre-filtering metadata items

including compiling a set of character of the allowed metadata and checking whether one or more characters of the metadata fall within the data set.
2. A method as claimed in claim wherein only a portion of the metadata item is checked for characters..
3. A method as claimed in claims I or two where there are more than one metadata items and said character set comprises characters from both.
4. A method as claim in claim 2 wherein both the characters and sequence of said metadata item or a portion thereof, is checked with those of metadata items of interest.
5. A method as claimed in claim 1 where a hash function is also used on part or all of the metadata items.
6. A method as claimed in any previous claims where characters of the metadata are skipped.
7. A method as claimed in any previous claim where only the last character(s) of the metadata item is checked.
8. A method as claimed in any previous claim wherein said metadata items is an email.