GB2445763A - Metadata filtering - Google Patents

Metadata filtering Download PDF

Info

Publication number
GB2445763A
GB2445763A GB0700926A GB0700926A GB2445763A GB 2445763 A GB2445763 A GB 2445763A GB 0700926 A GB0700926 A GB 0700926A GB 0700926 A GB0700926 A GB 0700926A GB 2445763 A GB2445763 A GB 2445763A
Authority
GB
United Kingdom
Prior art keywords
metadata
characters
filtering
lookup
email
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0700926A
Other versions
GB0700926D0 (en
Inventor
Neil Duxbury
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roke Manor Research Ltd
Original Assignee
Roke Manor Research Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roke Manor Research Ltd filed Critical Roke Manor Research Ltd
Priority to GB0700926A priority Critical patent/GB2445763A/en
Publication of GB0700926D0 publication Critical patent/GB0700926D0/en
Priority to DK08701850.3T priority patent/DK2122503T3/en
Priority to PCT/GB2008/000172 priority patent/WO2008087429A1/en
Priority to CA002675756A priority patent/CA2675756A1/en
Priority to PCT/GB2008/000184 priority patent/WO2008087438A1/en
Priority to EP08701850A priority patent/EP2122503B1/en
Priority to CA002675820A priority patent/CA2675820A1/en
Priority to EP08701862.8A priority patent/EP2122504B1/en
Publication of GB2445763A publication Critical patent/GB2445763A/en
Priority to US12/505,147 priority patent/US20090276427A1/en
Priority to US12/505,179 priority patent/US8380795B2/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8373Query execution

Abstract

The invention comprises a method of validating and pre-filtering metadata items including compiling a set of characters present in the allowed metadata and checking whether one or more characters of the metadata to be filtered fall within the data set. In one embodiment, only a portion of the metadata item is checked for allowed characters. Where there are more than one metadata items the character set comprises characters from both.

Description

1 2445763 Metadata prefiltering This invention relates to a method of
processing metadata. Metadata is the term used for dat awbich represents data and includes email and web addresses.
The methodology can also be applied to a large number of other meta-data types including: URJJURL identification, SIP (Session Initiation Protocol) URI identification; E. 164 telephone number detection; Tag detection in other data formats, IP addresses, port range, protocol and session identifier detection, xml data structures, xml objects, HTML structures and objects, detection of content types and identification of content from packet payloads.
Email lookup against a dictionary of target addresses is important in SPAM filtering for rejecting mail from known SPAM agents. The process of email address lookup against a target database can prove to be a significant performance bottleneck. The principal reason for this performance bottleneck is the processing overhead associated with checking whether all email addresses extracted from a data sample are in a database of target email addresses.
Various techniques are known to extract email addresses. These e.g. use the character set defined by the standards for identifying an email address. However, for lookup only specific email addresses are required.
In reality the probability of obtaining a hit on the database with an arbitrary email address is < 1%. Consequently, 99% of the lookup effort is spent rejecting potential items. The main existing method for solving this problem is to extract all email addresses in the sample and look them up one by one. This has the drawback that around 99% of the addresses presented to the lookup phase are not in the dictionary.
Thus, the lookup algorithm spends most of its time rejecting potential matches. Such primitive methodology is shown in figure 1 which shown the methodology of having to search just one domain name, "bird. com" Some of this effort can be saved by reducing the number of items that are presented to the database for comparison. This is achieved according to the invention by pre-filtering the extracted email addresses based on the contents of the target dictionary.
This application describes an email address pre-filtering method. Although the methodology is designed specifically for application to email addresses the techniques employed can be used to identify many other structured forms of data, for
example URLs.
The invention comprises a method of validating and pre-filtering metadata items including compiling a set of character of the allowed metadata and checking whether one or more characters of the metadata fall within the data set. Only a portion of the metadata item maybe checked for allowed characters.
Where there are more than one metadata items and said character set comprises characters from both.
Both the characters and sequence of said metadata item or a portion thereof, maybe checked with those of metadata items of interest.
Alternatively a hash function is also used on part or all of the metadata items.
Email address lookup according to the invention is normally a two stage process involving first pre-filtering followed by if necessary comparison of the extracted email against a database containing target email addresses. The invention decreases the number of comparisons that must be made against this database by performing pre-flitering as part of the email address extraction phase. This application focuses on the various ways in which this pre-filtering may be achieved.
In this scenano the absence of the full spectrum of valid characters in the user and domain name pails means that fewer email addresses are identified by the extraction algorithm. However, by supporting the addresses defined in the dictionary we also ensure that those addresses we are interested in are successfully identified so they can be passed onto the lookup stage. This change reduces the number of items that are passed on to the full lookup phase and thus speeds up the overall pipeline of extraction and lookup.
Pre-filtering can be achieved in a number of ways including: a) pre-fihtermg based on a restricted character set. This increases performance by utilising the reduced character set defined by the dictionary entries when compared to the set defined in the standards. The reduced set means that hitting an email address is less lilcely.
b) pre-filtering based on a restricted character set and using the structure of the items in the target dictionary. The addition of structure to the filter means that hitting an email address is less likely.
c) the use of a state machine for the structured filtering.
d) the use of hashing for the structured filtering.
e) the use of a tree structure for the structured filtering.
f) the combined lookup and extraction algorithm utilising a tree structure. This effectively does away with the lookup stage all together and performs the extraction and lookup simultaneously.
Pre-filtering based on trees with skip vertices to reduce the memory overhead of the approach. This reduces the memory requirement of implementing the tree based approach whilst providing similar if not better statistics for email hit rate as the character and structure based approaches.
Example in
In this simple embodiment of the invention the pre-filtering comprises building up a character set and testing characters of the target metadata with the character set For example if one had a list of web sites (domain names) and "hotmail.com" was of interest then according to the method the character set would comprise {h, o, t, m, a, i, 1, ., c, 0. m} Where the metadata are domain names, one has to take into consideration the dots (.) Thus when building a search algorithm whenever a character was encountered which isn't part of the set then it can be disregarded (or regarded depending on the situation).
If e.g. there are two domain names of Interest (hotmail.com, google.com) the character set may include the characters from both.
In genera! the approach described here restricts the set of characters that can appear in an email address to those that appear in the target dictionary. Any detection algorithm based on this restricted set will then provide enhanced perfonnance as the probability of finding an email address composed of the restricted set is less than the probability of finding an email address composed of the full set. Thus, for an arbitrary sample fewer instances are passed on to the full lookup stage which enhances the overall performance. This methodology is useful when only looking for a small number of domains e.g. hotmail.com where it gives an increase in throughput in the extraction phase of -2'0%.
Example 2
Figure 3. illustrate this methodology. In effect the resulting state machine says find email addresses whose domain name is prefixed with the sequences in the set (bi, ba, Ii, la) and where the remaining suffix contains the characters in the set {rd.com) i.e. the end of the email address is compressed into a reduced number of states and the prefix is expanded to give enhanced filtering. This approach effectively limits the S amount of branching that can occur and simplifies the implementation of the approach.
This is more specific example than the example 1 methodology where the partial structure (sequence) as well as the character set is tested.
Example 2b
In addition the state machine (process) of the last example may be truncated, i.e. just checking for a prefix with "Ia" or "bi" Example 3, hash based pre-filtering This may use a hash function for a whole or part of the domain name (s) of interested.
The hash function may be any suitable one. For example each character may be assigned a numerical value and some operation performed. Hashing in certain circumstances is an effective and efficient method of testing metadata. Hashing may also be done for only a portion of the domain name. For example, hash over the first 4 or more characters of the email address.
Partial hashing may be done in conjuction with checking the remainder of the metadata for either valid characters or for the correct sequence. Figure shows a prefix hash with tail end state machine.
Example 4 Tree based pre-filtering In the state machine based pre-filtering method described previously does not make full use of the underlying structure contained by the email addresses in the target dictionary. In particular the mapping of several edges to a single vertex allows emails such as: hin@lird.com to defeat the filter. This is because it contains "bi" and all characters are part of the allowed character set. This disadvantage can be addressed by adding additional vertices to the compressed or full form of the state machine. This is illustrated for the domain name state machine in figure 5. This approach completely obviates the need for the lookup stage as the lookup and identification of an email address within the dictionary are done simultaneously. in this modification to the approach the lookup and extraction algorithms are merged so that the lookup is performed as an email address is extracted. Consequently, there is no longer any need for the Iookup phase. Thus, this design should result in the highest performance.
Example 4a
The method of path compression' can also be applied to the above approach.
Although this would still require a follow on look up stage the advantage of this method is that it would greatly reduce the amount of memory required to represent the set of dictionaiy email addresses. A path compressed version of the state machine shown in figure 5 is illustrated in figure 6. Within the path compressed variant of the data structure a skip value is added to each internal vertex. The skip value allows to algorithm to move forward and check the only possible ending for the appropriate branch. A mismatch at this position leads to rejection whereas a match passes the filter.
This modification saves memory by removing the internal nodes and minimises the number of characters that need to be looked at to detennine if the email address is worth looking up. This modification greatly reduces the number of comparisons that need to be made in the pre-filter and also significantly improves the pre-filtering for large dictionaries. Use of this method is expected to reject a large number of the candidate email addresses before they are looked up.
The general approach is to introduce pre-filtering into the email address extraction phase in order to avoid unnecessary comparisons against the target email address database. In general the approach increases the overall performance of the extraction lookup pipeline by reducing the number of items that are passed on to the lookup stage. As the lookup stage is the slowest point in the pipeline reducing the number of thnes it is called upon effectively increases the overall throughput.

Claims (8)

  1. Claims I. A method of validating and pre-filtering metadata items
    including compiling a set of character of the allowed metadata and checking whether one or more characters of the metadata fall within the data set.
  2. 2. A method as claimed in claim wherein only a portion of the metadata item is checked for characters..
  3. 3. A method as claimed in claims I or two where there are more than one metadata items and said character set comprises characters from both.
  4. 4. A method as claim in claim 2 wherein both the characters and sequence of said metadata item or a portion thereof, is checked with those of metadata items of interest.
  5. 5. A method as claimed in claim 1 where a hash function is also used on part or all of the metadata items.
  6. 6. A method as claimed in any previous claims where characters of the metadata are skipped.
  7. 7. A method as claimed in any previous claim where only the last character(s) of the metadata item is checked.
  8. 8. A method as claimed in any previous claim wherein said metadata items is an email.
GB0700926A 2007-01-08 2007-01-18 Metadata filtering Withdrawn GB2445763A (en)

Priority Applications (10)

Application Number Priority Date Filing Date Title
GB0700926A GB2445763A (en) 2007-01-18 2007-01-18 Metadata filtering
EP08701862.8A EP2122504B1 (en) 2007-01-18 2008-01-18 A method of extracting sections of a data stream
PCT/GB2008/000184 WO2008087438A1 (en) 2007-01-18 2008-01-18 A method of extracting sections of a data stream
PCT/GB2008/000172 WO2008087429A1 (en) 2007-01-18 2008-01-18 A method of filtering sections of a data stream
CA002675756A CA2675756A1 (en) 2007-01-18 2008-01-18 A method of filtering sections of a data stream
DK08701850.3T DK2122503T3 (en) 2007-01-18 2008-01-18 PROCEDURE FOR FILTERING SECTIONS OF A DATA STREAM
EP08701850A EP2122503B1 (en) 2007-01-18 2008-01-18 A method of filtering sections of a data stream
CA002675820A CA2675820A1 (en) 2007-01-18 2008-01-18 A method of extracting sections of a data stream
US12/505,147 US20090276427A1 (en) 2007-01-08 2009-07-17 Method of Extracting Sections of a Data Stream
US12/505,179 US8380795B2 (en) 2007-01-18 2009-07-17 Method of filtering sections of a data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0700926A GB2445763A (en) 2007-01-18 2007-01-18 Metadata filtering

Publications (2)

Publication Number Publication Date
GB0700926D0 GB0700926D0 (en) 2007-02-28
GB2445763A true GB2445763A (en) 2008-07-23

Family

ID=37846541

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0700926A Withdrawn GB2445763A (en) 2007-01-08 2007-01-18 Metadata filtering

Country Status (1)

Country Link
GB (1) GB2445763A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1148856A (en) * 1967-01-26 1969-04-16 Ibm Data comparing systems
US5452451A (en) * 1989-06-15 1995-09-19 Hitachi, Ltd. System for plural-string search with a parallel collation of a first partition of each string followed by finite automata matching of second partitions
US20020188926A1 (en) * 2001-05-15 2002-12-12 Hearnden Stephen Owen Searching for sequences of character data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1148856A (en) * 1967-01-26 1969-04-16 Ibm Data comparing systems
US5452451A (en) * 1989-06-15 1995-09-19 Hitachi, Ltd. System for plural-string search with a parallel collation of a first partition of each string followed by finite automata matching of second partitions
US20020188926A1 (en) * 2001-05-15 2002-12-12 Hearnden Stephen Owen Searching for sequences of character data

Also Published As

Publication number Publication date
GB0700926D0 (en) 2007-02-28

Similar Documents

Publication Publication Date Title
US20160048585A1 (en) Bloom filter with memory element
JP5328808B2 (en) Data clustering method, system, apparatus, and computer program for applying the method
CN106384048B (en) Threat information processing method and device
US8925087B1 (en) Apparatus and methods for in-the-cloud identification of spam and/or malware
US7277885B2 (en) Systems and methods for filter processing using hierarchical data and data structures
US7571278B2 (en) Content access memory (CAM) as an application hardware accelerator for servers
US7802299B2 (en) Binary function database system
CN107370718B (en) Method and device for detecting black chain in webpage
CN106126383B (en) A kind of log processing method and device
US20160188723A1 (en) Cloud website recommendation method and system based on terminal access statistics, and related device
US7895515B1 (en) Detecting indicators of misleading content in markup language coded documents using the formatting of the document
CN104572983B (en) Construction method, String searching method and the related device of hash table based on internal memory
CN105426474B (en) The matched method and device of uniform resource position mark URL
WO2020134311A1 (en) Method and device for detecting malware
CN110032724A (en) The method and device that user is intended to for identification
CN106911640A (en) Cyberthreat treating method and apparatus
US8380795B2 (en) Method of filtering sections of a data stream
Aldwairi et al. Exscind: Fast pattern matching for intrusion detection using exclusion and inclusion filters
US7567568B2 (en) Method and apparatus for user identification in computer traffic
US9411877B2 (en) Entity-driven logic for improved name-searching in mixed-entity lists
CN106789859A (en) message matching method and device
US8799268B2 (en) Consolidating tags
FI3972192T3 (en) Method and system for layered detection of phishing websites
CN111061972A (en) AC searching optimization method and device for URL path matching
GB2445763A (en) Metadata filtering

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)