CN110785979B

CN110785979B - System, method and domain tokenization for domain spoofing detection

Info

Publication number: CN110785979B
Application number: CN201880041998.6A
Authority: CN
Inventors: 迈克尔·希夫曼
Original assignee: Farsight Security Inc
Current assignee: Farsight Security Inc
Priority date: 2017-05-17
Filing date: 2018-05-17
Publication date: 2021-02-05
Anticipated expiration: 2038-05-17
Also published as: CN110785979A; EP3635938A1; EP3635938A4; WO2018213574A1

Abstract

Systems and methods for detecting domain name spoofing in a Domain Name System (DNS) are described. A malicious party may register a domain name in the DNS that spoofs a domain name associated with a company in an attempt to entice the user to a malicious destination network address based on the user's trust in the company. This may result in a diminished online persona of the company, as the company's domain is associated with malicious activity. In an embodiment, a system is described that receives input from a subscriber including a domain name that the subscriber wishes to protect, ignore, or give special scrutiny. The system receives an instance of a domain name registered in the DNS and performs a method of determining whether the domain name is attempting to impersonate the subscriber's domain name. An alert is generated so that the subscriber can take corrective action.

Description

System, method and domain tokenization for domain spoofing detection

Technical Field

The field relates generally to Domain Name System (DNS) and domain name impersonation.

Background

Communication networks allow data to be transmitted between two different locations. To transmit data over a network, the data is typically divided into a plurality of segments, called packets or blocks. Each packet or block may have a destination network address, such as an Internet Protocol (IP) address indicating the destination of the packet and an intermediate forwarding device that should route the packet. These addresses are often numerical, difficult to remember and may change frequently. Because of this difficulty, these addresses are frequently associated with "domain names," readable strings that are typically associated with the owner of one of the addresses. The domain name consists of substrings called "labels" that are separated by dots, such as "www.example.com.", where "www", "example" and "com" are all labels. The domain name, when typed into a web application such as a web browser, translates into a true form of IP address representing the destination network address. For example, the google search engine is associated with a Fully Qualified Domain Name (FQDN) "www.google.com." and this domain name, when typed into a web browser, may be converted to a numeric IP address, such as "192.168.1.0".

The DNS is a system that enables this translation. The DNS stores mappings between domain names and their respective IP addresses, tracks any changes in the mappings, where domain names can be remapped to different IP addresses and vice versa, and performs domain name to IP address translations. Thus, DNS is commonly referred to as the "phonebook" of the internet, in which domain names and their respective IP addresses are stored. The DNS converts domain names to IP addresses under the command of a web application, such as a web browser, so that a user of the web application can simply remember the domain names instead of the numeric IP addresses. DNS can divide a domain space into hierarchies, where different organizations control different portions of a hierarchy. In different parts of the hierarchy, different name servers may store resource records that map domain names to network addresses.

To look up a network address from a domain name, the DNS may use resolvers that perform query sequences on different name servers. For example, the query sequence for resolving www.example.com may begin with a root name server indicating the address of the name server of gTLD ". com". The DNS resolver may then query the name server of the ". com" domain to obtain the address of the name server of example. Com may then query the name server of example.com to obtain www.example.com's address. In fact, to eliminate the need for the resolver to traverse the entire sequence for each request, the resolver may cache the addresses of the various name servers.

DNS suffers from significant security issues, both due to its age and the originality of the illegitimate party. In particular, creating a new entry in the DNS is completely unsupervised. A party may register a domain name and its corresponding IP address through several domain name registrar services, which are essentially private enterprises that are certified to create new records in the DNS that map IP addresses and new domain names. Many new domain names are registered each day. Some domain names are registered for malicious purposes.

One of these malicious purposes may be broadly referred to as "domain name impersonation," in which an illegitimate party may register a new domain name in an attempt to fool an ordinary internet user into believing that the new domain name is associated with certain well-known company and brand names. This rogue party can trick internet users into directing their traffic to the rogue party's own website or other server that can perform illegal activities by impersonating a well-known entity. When a user attempts to access the domain name, the DNS can translate the domain name into a network address (such as an IP address) that the user is completely unpredictable and may exist for illegitimate purposes.

The illicit purpose may include: introducing malware into a user's computer system, or performing an internet-based fraud called "phishing". Phishing websites may provide the appearance of legitimate companies to entice users to reveal confidential personal information, such as passwords and credit card numbers. These illegal actions may diminish the brand value of a particular company because its brand name and online image are considered untrusted.

Domain name impersonation can take many forms designed to fool users with different policies. For example, an illegitimate may register a new domain name that includes extraneous characters in other well-known domain names, such as dashes. For example, the legitimate domain name "www.coca-cola.com" may be impersonated by another domain name having the same letters and extra dash characters (such as "www.co-ca-col-a.com"). A nefarious party may register a domain name that, when read, has a similar pronunciation to a brand name (e.g., "www.koka-kola. In another case, a domain name may be registered that replaces a character with a different character having a similar appearance, such as replacing the character of the letter "I" with the number "1". The problem is also further compounded by the recent development of Internationalized Domain Names (IDNs) in which characters other than latin letters can also be used in the domain name and can be converted by DNS. In all cases, these domain names may be converted to IP addresses that can perform illegal actions on users who access them.

Therefore, there is a need for systems and methods for detecting potential instances of domain name spoofing of company brands and domain names.

Disclosure of Invention

In an embodiment, a system for determining whether a domain name newly registered with a DNS is a fake subscriber brand or a subscriber domain name is disclosed. The filter module first determines whether the domain name matches an entry in the list of allowed domain names. If the domain name does not match an entry in the list, the domain name is passed to a pre-processing module that removes the extraneous character to generate a processed domain name string. The processed domain name string is then passed to a token parser module that generates a candidate string based on the processed domain name string. The matching engine then receives the candidate character strings and determines whether the candidate character strings match the subscriber brand or domain by processing both character strings using one of a number of matching algorithms. If the criteria are met based on the processing, the candidate string matches the subscriber brand and the matching engine generates an alert report to send to the subscriber.

In an embodiment, a method of generating a plurality of candidate tokens from a DNS name is disclosed. First, a fully qualified DNS name string is received and processed to generate a processed DNS name string. The processed DNS name string is then parsed to generate a plurality of labels, where each label is a substring of the processed DNS name string. Then, the total number of labels of the processed DNS name string is determined. Then, for each integer value between one and the total number of tags, a subset of tags equal to the integer value is obtained from the plurality of tags, and the tags in the subset of tags are concatenated together to form the candidate token. The candidate token is added to the plurality of candidate tokens. After the plurality of candidate tokens are generated, each candidate of the plurality of candidate tokens is analyzed to determine whether it matches the subscriber string.

Method, apparatus and computer program product embodiments are also disclosed. The drawing in which an element first appears is generally indicated by the leftmost digit(s) in the corresponding reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.

Drawings

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present disclosure and, together with the description, further serve to explain the principles of the disclosure and to enable a person skilled in the pertinent art to make and use the disclosure.

Fig. 1 illustrates DNS name spoofing according to an embodiment.

Fig. 2A-2B illustrate a system for detecting DNS name spoofing, according to an embodiment.

Fig. 3 is a flow diagram illustrating a method for detecting DNS name spoofing, according to an embodiment.

Fig. 4A-4B are flow diagrams illustrating DNS name preprocessing and candidate token generation according to embodiments.

FIG. 5 illustrates an example of how a set of strings may be pre-processed.

Fig. 6 illustrates an example of how another set of strings may be preprocessed.

FIG. 7 illustrates a module for matching two strings, according to an embodiment.

Fig. 8A-8B illustrate algorithm categories for finding literal matches between strings, according to an embodiment.

Fig. 9A to 9B illustrate algorithm categories for finding a voice match between character strings according to an embodiment.

Fig. 10A-10B illustrate matching algorithm categories for finding pictograph matches between strings, according to an embodiment.

The drawing in which an element first appears is generally indicated by the leftmost digit(s) in the corresponding reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.

Detailed Description

The system allows a company (also referred to as a subscriber) to initiate a service to detect any attempt by a rogue party to find an FQDN within a DNS that can mimic the well-known brand and domain name of the subscriber. In embodiments, the subscriber may enter relevant information about their brand, such as the brand itself, expressed as a string, a known domain name space that the subscriber may own and therefore not wish to monitor, and a domain space that the subscriber specifically wishes to monitor for potential brand counterfeiting. Upon detecting spoofing, the system generates and sends an alert report to the subscriber so that the subscriber can be notified of the potential spoofing and so that the user can take any corrective steps.

The notification may occur in real time or near real time. The DNS sensor node array transmits DNS records that contain FQDNs in real time. The DNS sensor node passively observes DNS queries and responses resolved by the DNS server and resolver and records them. These sensor nodes may be placed at strategic observation points in various networks around the world. They may be in the industry networks of Internet Service Providers (ISPs), internet exchanges, internet cloud service operators, well-known social networks, universities, and large corporations. The DNS data observed by these sensor nodes enables the creation of a feed of newly observed DNS records that deliver DNS records to various destinations only a few seconds after the DNS records are first observed. Because of this feed, the disclosed system can receive DNS records in real-time and can process these records to determine whether they represent a domain name spoofing attempt in the order of seconds. In this manner, the system can detect any instances of DNS spoofing in real-time or near real-time.

Not only can detection occur in real-time, embodiments can also improve the breadth and quality of DNS spoofing detection. To improve the quality of the detection, the system employs several sophisticated techniques. The system processes the newly received FQDN to tokenize it to obtain a large number of candidate tokens. The tokenization process may remove the bait introduced by the illicit party and the resulting candidate token more fully exposes a substring within the FQDN that may trick internet users into believing that the FQDN represents the subscriber brand. By more fully exposing these substrings, preprocessing and tokenization may improve the ability of the system to detect subscriber brand spoofing.

The system applies a number of algorithms to the candidate token to match it to some criteria. These algorithms compare the subscriber's brand with various candidate strings generated by the preprocessing equipment to determine whether the FQDN is attempting to impersonate the subscriber's brand. The algorithms are grouped into three separate groups including text algorithms, speech algorithms, and pictographic algorithms. The literal algorithm looks for a substring match or a full string match within the FQDN that matches the subscriber's brand. The voice algorithm looks for candidate tokens that can be pronounced in a similar manner to the subscriber brand, thereby enticing internet users to believe that a fake FQDN represents the brand. The pictograph algorithm looks for strings that are visually close to the subscriber's brand.

A pictograph algorithm, particularly an Internationalized Domain Name (IDN) homomorphic-heterogram algorithm, detects an Internationalized Domain Name (IDN) attempting to impersonate a subscriber brand by employing non-latin characters, such as cyrillic or greek characters, that have a similar appearance to native characters. In one embodiment, the IDN pictograph algorithm generates substitute strings that replace some of the latin characters within the subscriber brand with non-latin characters based on the subscriber brand and uses the substitute strings to compare candidate tokens generated based on the received IDN.

The detailed description is divided into three sections. First, an embodiment of a system for receiving, resolving, and detecting potential DNS name spoofing is described with respect to fig. 1 and fig. 2A-2B. Second, embodiments of a preprocessing system and method to generate multiple spoofed candidates from a single DNS name are described with respect to fig. 4-6. Finally, embodiments of a matching engine and matching algorithm are described with respect to fig. 7, 8A-8B, 9A-9B, and 10A-10B.

DNS counterfeit detection system

Fig. 1 illustrates the problem of DNS name spoofing. As described above, DNS name impersonation is an attempt by a lawless party to register a domain name in DNS that can be mistaken for a domain name or brand of a well-known company in an attempt to spoof the general public with an illegitimate purpose. This can be detrimental to the company's brand, as it can lead to public annoyance with the websites and online services provided by the company.

A company may wish to protect the FQDN it has registered with the DNS by a domain name registrar. Domain names that are similar in appearance or pronunciation to those FQDNs may trick the public into believing that those domain names are also associated with the company, in fact these domain names exist only for illegitimate purposes, such as resolving to illegitimate websites that host malware or email phishing fraud. Meanwhile, due to the similarity of the service or brand name used by the company and the slogan, the public may associate the company with some FQDNs with which the company has not registered with the DNS. For example, even if a company has not registered a domain name with a similar appearance in the DNS, one may associate the company with an abbreviation that is similar to the company's name or a slogan that appears in the company's advertisement. A rogue party may register a domain name that is similar in appearance or pronunciation to both the registered FQDN and the unregistered name and phrase associated with the company in an attempt to fool people who are not giving up into believing that the company is associated with the domain name. In this case, a company may wish to detect when a domain name is registered in the DNS that may be mistaken for itself.

Fig. 1 illustrates an example of such a scenario. In fig. 1, a company named "Farsight Security Incorporated" may have a list of brands associated with the company and a list of internet-oriented FQDNs representing the brands. Fig. 1 illustrates a brand list 101 and an FQDN list 102. Each member of both brand list 101 and FQDN list 102 is represented as an American Standard Code for Information Interchange (ASCII) string. For purposes of this illustration, the members of these lists are all referred to as entries. Entry 102A of FQDN list 102 may be an FQDN representing the company's primary website ("www.farsightsecurity.com" in this case). The FQDN list 102 may include entries 102B-D that include similar FQDNs with different gTLD (such as, ". com", ". net", and ". org") that the company has registered with the DNS. FQDN 102 can also include entries 102E-F having abbreviations (such as "fsi") associated with the company. Finally, the list may contain similar entries 102G-J representing website names or email addresses that the company feels may be associated with its company in the public eye.

List 104 contains FQDNs that may be mistaken for corresponding names in FQDN list 102. This list is intended only to illustrate the possibility of domain name impersonation and should not be considered limiting. List 104 illustrates various methods of domain name impersonation that may be successful in enticing people to believe that these domain names are associated with a company. Although the cases represented in the list 104 represent primarily a single technique that a fraudster may use, these techniques are often used in combination.

One simple technique is to replace the characters with different characters having similar appearances. For example, entry 104A replaces an alphabetic character with a numeric character having a similar appearance (e.g., replaces the letter "I" with the number "1") to impersonate entry 102A in FQDN list 102.

Another technique may add extraneous subdomains separated by dots, or add extraneous characters such as dashes within subdomains that when read do not appear to detract from its reading as a domain name. For example,

entries

104B, 104C, and 104F add periods and dashes to the domain name, which would otherwise contain nearly the same characters as the

entries

102B, 102C, and 102F, respectively, that the company wishes to protect. Entry 104C "www.fa-r-sig. ht. org" contains all the letters of entry 102C "www.farsight.org", but is a different registered domain name registered with the DNS and therefore can resolve to a different and potentially harmful IP address.

Yet another technique is to register domain names that are pronounced in a similar manner to legitimate domain names.

Entries

104D, 104E, 104G, and 104I represent domain names that may be pronounced when read in a similar manner as

entries

102D, 102E, 102G, and 102I, respectively. For example, entry 104D replaces the letter "f" with the letter combination "ph" to mimic the pronunciation of the word "burst," but may be registered in the DNS to resolve to a different IP address.

Finally,

entries

104H and 104J contain characters from foreign letters (such as cyrillic) that may be mistaken for latin characters having a similar appearance, such as replacing the letter "a" with the cyrillic letter "aka". Some characters in cyrillic and greek are nearly identical in appearance, but are not recognized as ideographs by the DNS. Also, these domain names may trick people who are not wary of accessing destinations with potentially illegal purposes. The Internationalized Domain Name (IDNA) system in the application allows strings containing these characters to be converted in DNS via a system that encodes unicode glyphs in ASCII form (called "punycodes").

It should be noted that not all of these domain names may be associated with illegitimate parties. Legitimate entities may register domain names that are similar to those associated with a company. Likewise, not all names detected by embodiments of the present disclosure may result in actionable alerts. The goal of the systems and methods described in this disclosure is to generate an alert to send to a company when the system detects attempted spoofing of the company's domain name and brand. Companies registered with the service may then determine what action, if any, should be taken to prevent impersonation of their brand and domain name.

Fig. 2A is a diagram of a system 200 for detecting domain name spoofing. In an embodiment, system 200 includes a domain name whitelist database 220, a watchlist database 225, a brand list database 240, a matching policy database 230, and a match detection system 210. The match detection system 210 includes several functional modules including a filter module 212, a FQDN preprocessor 214, a token parser 216, and a matching engine 218. The preprocessor 214 outputs a preprocessed candidate FQDN215, the token parser 216 outputs a candidate token 217, and the matching engine 218 outputs an alert report 250.

In an embodiment, a subscriber (such as a company or organization) subscribes to a domain name spoofing detection service performed by the system 200. The subscriber enters a set of parameters for the service that he wishes to receive from the system 200, including the brand name to be protected, the domain space to be monitored, the domain name and wildcards to be ignored, a set of matching algorithms to be used to detect DNS name spoofing, and the sensitivity in detecting potential domain name spoofing. These parameters are represented in brand list 242, white list 222, watch list 227, and algorithm list 232, which are stored in brand list database 240, white list database 220, watch list database 225, and matching policy database 230, respectively. The database feeds these parameters into the various modules of the match detection system 210 to perform the specified service. In embodiments, several different configurations may be applied to these databases: these configurations may all be contained on a single server device or cluster of server devices, they may be contained on separate server devices, or they may be contained on the same server device or cluster along with the entire match detection system 210. These embodiments are non-limiting and those skilled in the art will recognize that several configurations of these databases are possible.

In an exemplary embodiment, match detection system 210 uses inputs from whitelist 222, watch list 227, brand list 242, and algorithm list 232 to perform the steps required to detect potential domain name spoofing for a subscriber. In embodiments, the various modules of the match detection system 210 may be implemented on a server device or cluster of server devices, on individual server devices, or even in a commercial cloud data center, such as Amazon Web Services (AWS). Those skilled in the art will recognize that any of these embodiments (most of which are not enumerated herein) will be suitable for implementing the match detection system 210. The functionality of the different modules of the match detection system 210 are briefly described herein, and a more detailed discussion is provided below.

DNS resource records may be received from an array of DNS sensor nodes. This data provides a snapshot of the DNS configuration and content data as if it were being used on the internet in real time. The DNS resource records may be decomposed into multiple sets of resource records (RRsets). The RRset may include one or more DNS Resource Records (RRs). The RR may be a single DNS record and may include several fields. These fields include:

an owner name field, which may specify the FQDN for which the DNS query is generated, such as www.example.com;

a time-to-live (TTL) field, which may indicate the amount of time (in seconds) that an RR may be cached.

A category field, which may indicate a protocol family or protocol instance, such as "IN" for the Internet protocol.

A type field, which may indicate the type of DNS record. An example of an RR type is type "a" ("address") which indicates that a DNS record maps an FQDN to an IPv4 address. As another example, an RR type of "AAAA" ("IPv 6 address") indicates that the DNS record maps an FQDN to an IPv6 address.

Type specific data, such as an IP address that maps to the FQDN of the query.

In an example, the DNS record may map the FQDN to an IP address. The RRset may be a set of all resource records of a given type for a given domain. For example, multiple RRs may map FQDNs (such as "www.example.com.") to multiple different IPv4 addresses. In this example, the RRset of "www.example.com." contains all of these IPv4 addresses.

As discussed above, the DNS sensor node array observes new DNS records, and the disclosed system uses these DNS records to perform domain name spoofing detection. DNS sensor nodes of the DNS sensor node array observe and record DNS queries resolved by the DNS server and resolver. The DNS sensor node may then send the DNS records to various destinations for further processing. These sensor nodes are placed at strategic points of observation, such as at Internet Service Providers (ISPs), internet exchanges, internet cloud service operators, well-known social networks, universities, and industry networks of large corporations. These sensor nodes create a feed of newly observed DNS records that can be fed to various destinations only a few seconds after the DNS records are first observed. Because of this feed, the disclosed system can receive DNS records in real-time and process these records to determine whether they represent a domain name spoofing attempt in the order of seconds.

The filter module 212 is the first module in the match detection system 210 to receive a new DNS record 209. In an embodiment, the filter module 212 may employ a dedicated communication port, such as a Universal Datagram Protocol (UDP) port, to listen for new DNS records that may be received from the above-described DNS sensor node array. The filter module 212 can receive the DNS record 209 and determine whether the DNS record is of interest. When domain name spoofing is detected, only certain specific types of DNS records are of interest, such as address record (a), IPv6 address record (AAAA), pointer record (PTR), canonical name record (CNAME), and mail exchange record (MX). All these records may be used for different types of services. These will be discussed in more detail further below. In another embodiment, only DNS records are received at system 200, which have been determined to be of interest.

The watchlist database 225 stores a watchlist 227 of subscribers that specifies domain names that the subscriber wishes the system 200 to consider when determining potentially infringing domain names. In an embodiment, a subscriber may only wish to consider any domain name in a portion of the domain name space, such as a single gTLD (such as, ". com"). In this case, the subscriber may designate ". com" as its entry, so that the system 200 can consider only FQDNs in the ". com" top level domain. Alternatively, the subscriber may not specify any particular name space to observe, and the entry in the watchlist database 225 for that particular subscriber defaults to ". x.", indicating that any FQDN 213 should be considered. These strings are stored in the watch list 227 and the watch list is fed into the filter module 212 so that the filter module 212 can check the FQDN 213. Similar to white list 222, in embodiments, FQDN 213 must be one of an exact match for one of the full FQDNs specified in watchlist 222 or a match for a wildcard entry in watchlist 227 to pass back to the other elements of match detection system 210.

As discussed above, the whitelist database 220 stores a whitelist 222 that specifies domain names that the subscriber wishes to ignore. Some domain names newly registered in DNS may reflect legitimate or non-malicious domain names, due to companies with similar brand names or abbreviations, subscribers with company names that may also be similar to certain common phrases, and so on. Com "and will not want to receive alerts of potential domain name spoofing of new domain names that the company itself is registering with DNS. In embodiments, the entry entered by the subscriber may be in the form of an FQDN, partial domain name, or even a wildcard string representing a domain name, brand, or phrase that the subscriber wants the system 200 to ignore. The entries are stored in a white list 222 that is fed into a filter module 212 of the match detection system 210. Filter module 212 may then check to see if FQDN 213 from DNS record 209 matches any white list entries from the subscriber and disregard any records that produce a match.

In order to be ignored by match detection system 210, in embodiments, FQDN 213 must be an exact match for one of the complete FQDNs specified in whitelist 222 or a fully formatted match for a wildcard entry in whitelist 222. This is because the goal of the match detection system 210 is to detect FQDNs that closely but not exactly match domain names and brands associated with or owned by the subscriber, since these are the most likely attempts by some illegitimate parties to impersonate the domain name or brand of the subscriber.

Thus, filter module 212 receives whitelists 222 and watchlists 227 from whitelist database 220 and watchlist database 225, respectively, and determines whether FQDNs 213 stored in new DNS record 209 matches an entry on either list. In an exemplary embodiment, filter module 212 first determines whether FQDN 213 matches an entry on watch list 227, wherein if no match is detected, FQDN 213 is discarded. If a match is detected, the filter module 212 then determines whether the FQDN 213 matches an entry on the whitelist 222, wherein if no match is detected, the FQDN 213 is passed to the FQDN preprocessor 214 for further investigation and if a match is detected, the FQDN 213 is discarded. In other embodiments, the order of the matching may be reversed.

FQDN preprocessor 214 preprocesses FQDN 213 of DNS record 209 and generates preprocessed candidate FQDN215, which is submitted to token parser 216. FQDN preprocessor 214 preprocesses FQDN 213 to remove certain characters that are known to defeat the naive pattern matching algorithm. The preprocessed candidate FQDN215 is submitted to a token parser 216, which generates a candidate token 217. The token resolver 216 will be described in more detail below with respect to fig. 4-6. In an embodiment, FQDN preprocessor 214 removes the dash characters, which are the most commonly used decoy characters added to the FQDN to trick the pattern matcher. Thus, the FQDN 213 of "www.fars-1 rightsecyu. ritee.com." will be converted to the preprocessed candidate FQDN215 of "www.fars1ghtsecyu.ritee.com.". The candidate token 217 and the preprocessed candidate FQDN215 may be passed back to the matching engine 218. The matching engine 218 loads and runs several matching engine modules that implement a matching algorithm that compares the preprocessed candidate FQDN215 and candidate token 217 to entries of the brand list 242. If a match between any candidate and the subscriber-specified brand is detected, an alert report 250 is generated and sent to the subscriber specifying the FQDN 213 and other information relevant to the detection. The matching engine 218 and matching algorithm are described in more detail below with respect to fig. 7, 8A-8B, 9A-9B, and 10A-10B.

The brand list database 240 stores a brand list 242 containing subscriber brands expressed as ASCII strings that the subscriber wishes the system 200 to investigate domain impersonation. In an embodiment, the subscriber may specify brands that are stored in the brand list 240 and fed into the matching engine 218. After the preprocessed candidate FQDN215 and any candidate tokens 217 are obtained, they are passed to a matching engine 218 for comparison with entries in the brand list 242. If a match is detected, an alert report 250 is generated.

Matching policy database 230 stores a list of algorithms 232 that specifies the subscriber's choice of which matching algorithms matching engine 218 should use in determining a match. Matching engine 218 is capable of running several different types of string matching algorithms that attempt to match subscriber-specified entries in brand list 242 (stored in brand list database 240) with preprocessed candidate FQDNs 215 and candidate tokens 217. Upon initiating the service, the subscriber may specify a subset of the available matching algorithms to be executed by the matching engine 218, or may simply select all algorithms so that the matching engine 218 can run the entire set of available matching algorithms to compare entries in the brand list 242 with the preprocessed candidate FQDN215 and candidate token 217.

Fig. 2B illustrates an example of subscriber preferences that may be stored in a database of system 200. Fig. 2B illustrates a new DNS record 209 with FQDN 213, a preprocessed candidate FQDN215, a watchlist 227, a whitelist 222, a candidate token 217, a brand list 242, and an algorithm list 232.

Referring again to the example of a subscriber referred to as "Farsight Security", the entries may be as follows. The watch list 227 may contain only ". to represent that no restrictions should be imposed on the FQDNs to be checked by the match detection system 210. An FQDN 213 (such as "www.fars-1 rightsecyu. ritee. com") that does not match any entry in the white list 222 and does match an "×" entry in the white list 227 will be passed by the filter module 212 to the next stage of the system 200. As described above, white list 222 specifies a list of domain names and wildcards that system 200 should ignore, and may contain several domain names owned by Farsight Security corporation or other legitimate entities having the word "Farsight" in their name. Thus, the white list 222 may contain its primary web sites FQDN, "www.farsighsecurity.com.", wildcard entries ". fast.com", and ". fast security.com", as well as some other entries representing domain names known to the company. If a new DNS record 209 is received by the match detection system 210 with a FQDN 213, the filter module will determine if the DNS record 209 is of interest to the matching engine, extract the FQDN 213, and compare it to the white list 222. If the FQDN matches one of the entries in the white list (or the pattern indicated by the wildcard entry in the white list), then record 209 is disregarded.

Brand list 242 includes brand names that the subscriber may wish to protect. After the FQDN preprocessor 214 generates the preprocessed candidate FQDN215, the token parser 216 receives the preprocessed candidate FQDN215 and generates the candidate token 217, and may compare both the preprocessed candidate FQDN215 and the candidate token 217 to entries in the brand list 242 in the matching engine 218 depicted in fig. 2A. Brand list 242 should contain ASCII strings such as "fsi" and "farlightsecurity".

Finally, algorithm list 232 contains the matching algorithms that match engine 218 uses when comparing the preprocessed candidate FQDN215 and candidate token 217 to entries in brand list 242. The subscriber is provided with a choice of which matching algorithms they wish the matching engine to use. In the algorithm list 232 shown in fig. 2B, it is possible to specify only four algorithms even if there are more algorithms available. In some embodiments, some algorithms may also require parameters specific to the algorithm, which may also be stored in the algorithm list 232. The matching engine 218 and matching algorithm available in the system 200 will be discussed in more detail below.

Fig. 3 is a flow chart illustrating a method 300 performed by the system 200 of fig. 2A for detecting domain name spoofing. Where appropriate, the steps of the flow chart will be described with respect to the elements of system 200. The method 300 is described with respect to a single subscriber of the services provided by the system 200.

The method 300 begins with the arrival of a new DNS record in step 302. The record may be received from a large array of DNS sensor nodes that detect a new entry in the DNS. In an embodiment, the domain name record may be a full DNS record with a type field specifying its function. In step 304, the DNS record received in step 302 is submitted to the filter module 212 from fig. 2A, where it is first checked against a subscriber's watchlist (such as watchlist 227 as illustrated in fig. 2A and 2B). In an embodiment, watchlist 227 is obtained from a watchlist database (such as watchlist database 225 illustrated in fig. 2A). If the candidate FQDN does not match any entry in the watchlist, the method ends because the FQDN is not part of the domain name space of interest to the subscriber. As discussed above, with respect to fig. 2A, the watchlist may default to only include entries with an "x", thereby representing any FQDNs that may be of interest to the match detection system 210. In this case, step 304 will always produce a match, and control continues to step 306.

In step 306, the DNS record is examined to determine if it is a record of interest to the match detection system 210 of FIG. 2A. This check may occur in the filter module 212, or in some embodiments, before reaching the match detection system 210, in which case step 306 may not need to be performed. As described above, the DNS record types of interest to system 200 include record types "A", "AAAA", "CNAME", "PTR", and "MX". In other embodiments, additional record types may be added to the records of interest. If the DNS record type is one of the aforementioned types, the method 300 continues with step 308. If not, the method ends.

In step 308, the FQDN is extracted from the DNS record. In an embodiment, this may occur at the filter module 212 after determining that the system 200 is interested in the recording in step 306. The format of the FQDN is a simple string, e.g., "www.fars-1 rightsecyu. The FQDN may be an embodiment of FQDN 213 illustrated in fig. 2A.

In step 310, the FQDN obtained in step 308 is checked against a white list of subscribers (such as white list 222 as illustrated in fig. 2A and 2B). In an embodiment, the white list 222 is obtained from a white list database 220. If the FQDN matches any entry in the whitelist (including an exact match to an entry in the whitelist or a pattern match to a wildcard entry listed in the whitelist), then the method ends because the FQDN is the portion of the namespace that the subscriber has indicated that the match detection system 210 ignores.

After determining that the FQDN is on the watch list of subscribers in step 304 and not on the white list of subscribers in step 310, the FQDN may be preprocessed in step 312 to remove extraneous characters. As discussed above, the FQDN extracted from the new DNS record may contain extraneous characters (especially dashes) that are intended to fool the simple pattern matcher. In step 312, those characters are removed to produce a "preprocessed candidate FQDN," such as preprocessed candidate FQDN215 of fig. 2A.

In step 313, the pre-processed candidate FQDN obtained in step 312 may be submitted to the matching engine 218 and compared to the brand list of the subscriber. The brand list of step 313 may be an embodiment of brand list 242 illustrated in FIG. 2A and may be stored in a brand list database, such as brand list database 240 illustrated in FIG. 2A. As discussed above, the brand list includes strings of brands represented as ASCII strings specified by the subscriber. Specifically, the preprocessed candidate FQDN is first compared to the brand list of the subscriber using a literal matching algorithm that seeks to perform string literal and substring matching between the preprocessed candidate FQDN and the brand specified by the subscriber. This algorithm is discussed in more detail below. In some embodiments, the original, unprocessed FQDN may also be compared to entries in the brand list using these word-matching algorithms. If the comparison between the entry in the brand list and the preprocessed candidate FQDN (or original FQDN) yields a match based on the criteria of the literal matching algorithm used, a match is detected in step 314.

If a match is detected in step 314, the process moves to step 330, where an alert report is generated and sent to the subscriber. In an embodiment, the alert report may contain information about the DNS record, including the FQDN, the time the record was detected, which brand list entry is being spoofed, which algorithm determines the spoofing, and other contextual data that may be relevant to the subscriber. The subscriber may then take corrective action as deemed appropriate. After the alarm report is generated and sent, the process ends.

If a match is not detected in step 314, the process moves to step 315 where the match detection system 210 determines if it is configured to run any pictograph or speech matching algorithms. If a match is detected, the process ends and, if so, the preprocessed candidate FQDN string generated in step 312 is submitted to a token parser in step 316 to generate a candidate token. The candidate token generated in step 316 is an embodiment of the candidate token illustrated in fig. 2A-2B. In embodiments, this may occur within a separate software module (such as the token resolver 216 illustrated in fig. 2A). The module may be implemented in a device similar to the filter module 212 or a device separate from the filter module 212. The preprocessing steps, including preprocessing step 312 and token parser step 316, are described in more detail below with respect to fig. 4A-4B, 5, and 6.

In step 320, the candidate tokens generated in step 316 and the candidate FQDNs generated in step 312 are passed to a matching engine (such as matching engine 218 as illustrated in fig. 2A) to determine whether any of the candidate tokens represent an attempt to impersonate a brand associated with the subscriber. The matching engine 218 applies an algorithm that compares each candidate token to the subscriber's brand list. The applied algorithms are specified in an algorithm list (such as algorithm list 232 of fig. 2A) and stored in a matching policy database (such as matching policy database 230 of fig. 2A). In this step, different sets of algorithms may be applied than in step 313, including a speech matching algorithm and a pictograph matching algorithm. These algorithms are described in more detail below with respect to fig. 7, 8A-8B, 9A-9B, and 10A-10B.

During step 320, the candidate token is compared to the string from the brand list using each algorithm in the list of algorithms. The algorithm determines whether the candidate token is impersonating (in whole or in part) a brand list entry based on some criteria specific to the algorithm used. If the criteria are met, then in step 325 a match is detected and the method moves to step 330 where an alert report is sent to the subscriber to inform the subscriber that a fake FQDN has been registered with the DNS. The alert report will contain information about the DNS record including the FQDN, the time the record was detected, which brand list entry is being spoofed, which algorithm determines the spoofing, and other contextual data that may be relevant to the subscriber. The subscriber may then take corrective action as deemed appropriate. If no match is detected after cycling through all the different candidate tokens, brand list entries, and algorithms, the method ends without sending any report.

The skilled artisan will recognize that there may be many different ways for looping brand list entries, candidate tokens, and algorithms that may yield various speed improvements or other performance benefits. Useful matching engines and matching algorithms are discussed in more detail below with respect to fig. 7, 8A-8B, 9A-9B, and 10A-10B.

In an embodiment, each candidate token may be compared to each entry in the brand list of the subscriber using each algorithm specified by the subscriber in the algorithm list, and when a match is detected, an alert report is generated and sent to the subscriber. The order of how each list loops may vary. For example, in an embodiment, each brand list entry may be compared to a candidate token using each algorithm for a single candidate token being investigated, after which the next brand list entry is compared to the candidate token. This continues until all brand list entries are compared to the candidate token, after which the next candidate token is selected and the process is repeated for all brand list entries and algorithms. If one of the comparisons results in a match being detected (where the current candidate token and brand list entry are determined to be a match based on criteria of one of the algorithms), then an alert report containing all necessary information is sent to the subscriber and the process stops without cycling through any other candidate tokens. In embodiments, the alert report may contain the original FQDN obtained in step 308, the candidate token and brand list entries that have been determined to match, the algorithm used, the type of original DNS record received in step 302, the time of DNS record received by the system, etc.

In another embodiment, the brand list may be the last item of the cycle, where each candidate token is compared using each algorithm for each brand list entry, after which the next candidate token is compared. After all candidate tokens are compared, the next brand list entry is selected and the entire process is repeated until a match is found or all comparisons between each brand list entry and candidate tokens have been completed. If a match is detected, an alert report is generated and sent to the subscriber and the process stops.

B. Domain name preprocessing and tokenization

In this section, the preprocessing and tokenization processes are described. The purpose of preprocessing the FQDN is to obtain a string embedded within the FQDN that may represent an attempt to illicitly mimic the domain name or brand of the subscriber. Preprocessing and tokenization may generate strings that are more easily recognized by string matching algorithms. For example, a subscriber referred to as "Farsight Security" may have a registered FQDN of "www.farsightsecurity.com". An illegal party attempting to impersonate the domain name may register the DNS record with FQDN "www.ww.far.s1-light. It can clearly be seen that this second FQDN embeds a string that can be read as "false security" by an internet user without caution, but a simple string matching algorithm may not be able to determine that this newly registered FQDN is a fake attempt by a malicious party due to the presence of irrelevant periods (indicating new subdomains) and dash characters. Thus, a user who is not armed may be spoofed to access the IP address associated with the FQDN. The pre-processing and tokenization process aims to assist the matching algorithm by resolving the suspect FQDN and presenting a set of strings (referred to as candidate tokens) consisting of parts of the suspect FQDN to determine a match between the suspect FQDN and the brand and domain name of the subscriber.

Fig. 4A-4B illustrate different embodiments for pre-processing and tokenizing FQDNs to generate candidate tokens. Fig. 4A is a flow diagram illustrating a method 400 for pre-processing a FQDN to determine candidate tokens. In an embodiment, method 400 may be performed by a FQDN preprocessor and token parser (such as FQDN preprocessor 214 and token parser 216 presented in fig. 2A). Method 400 may be an embodiment of

steps

312 and 316 depicted in fig. 3.

In step 402, extraneous characters are removed from the original FQDN. The original FQDN may be an embodiment of FQDN 213 of fig. 2A. Step 402 may be an embodiment of step 312 in fig. 3. In an embodiment, the removed character is a dash character. As discussed above, a dash character is typically added to the FQDN to spoof the simple string matching algorithm used to remove the apparent instances of domain name spoofing. This step generates a preprocessed candidate FQDN, such as preprocessed candidate FQDN215 in fig. 2A. In the above example, the originally received FQDN "www.ww.far.s1-light.sec.yu-rit.ee.com" may be converted into a preprocessed candidate FQDN "www.ww.far.s1ght.sec.yurit.ee.com". The preprocessed candidate FQDN may be an embodiment of the preprocessed candidate FQDN215 of fig. 2A.

In step 404, the first step of the token resolver begins by obtaining the pre-processed candidate FQDN generated in step 402 and extracting each DNS tag. In an embodiment, a separator character "-" is a character that marks a boundary between different labels in the FQDN. For example, extracting the labels from the preprocessed candidate FQDN "www.ww.far.s1ght.sec.yurit.ee.com." yields 8 labels, "www", "ww", "far", "s 1 ht", "sec", "yurt", "ee", and "com". These tags may be combined in different left and right adjacent combinations, eventually resulting in candidate tokens that may match the subscriber's brand stored in their brand list. The tag itself is added as a candidate token.

In step 406, the process of creating a candidate token occurs. The counter is initialized to 1. The counter reflects the number of labels generated in step 404 that are to be combined to form a single candidate token. In step 408, candidate tokens are generated by concatenating several tags equal to the counter into one string. The labels forming a single candidate token must appear consecutively to each other from left to right in the preprocessed candidate FQDN generated in step 402.

As an example, the candidate FQDN "www.ww.far.s1ght.sec.yurit.ee.com." has 8 labels, "www", "ww", "far", "s 1 ht", "sec", "yurt", "ee", and "com". For a counter of 6, the candidate tokens may be "wwwwwfars 1 rightsecurit", "wwwars 1 rightsecuriteecom", and "wars 1 rightsecuriteecom", where each candidate token consists of the above listed 6 labels concatenated into a string. The first candidate token, "wwwwwwwfars 1 rightsecurit," concatenates the first 6 tags, "www," "ww," "far," "s 1ght," "sec," and "yurit," listed above. The candidate token "falls 1 rightsecuritiecom" combines the last 6 labels listed above, "fall", "s 1 ght", "sec", "yurt", "ee", and "com".

The order in which the tags are concatenated to generate the candidate tokens must be the same as the order in which they appear in the candidate FQDN string when read from left to right, and they must appear adjacent to one another in the candidate FQDN string separated by only a separator character. Thus, a string containing two tags placed together that do not occur consecutively in the candidate FQDN string will not be a valid candidate token. Referring again to candidate FQDN "www.ww.far.s1ght.sec.yurit.ee.com., a string such as" wwwwwfarsecyurt "will not be a valid candidate token because it places the non-consecutive labels" far "and" sec "adjacent to each other. The string "wwwwwwws 1 ghtfar" will also not be a valid candidate string, since it places the labels "far" and "s 1 ght" in an incorrect order when compared to the preprocessed candidate FQDN.

Thus, in step 408, for a given counter value, all valid candidate tokens are generated and added to the running list of candidate tokens. For a counter of 1, all tags themselves are considered candidate tokens. When the counter equals the total number of tags generated from the preprocessed candidate FQDN, a single candidate token is generated that is a concatenation of each tag in the correct order as read in the preprocessed candidate FQDN. In step 410, the counter is checked to determine if it is equal to the total number of tags. If it does not equal the total number of tags, steps 406 through 410 are repeated until all candidate tokens have been generated. The resulting candidate token may be an embodiment of candidate token 217 illustrated in fig. 2A.

Fig. 4B illustrates another flow diagram of a method 450 for pre-processing a FQDN to determine candidate token strings. In an embodiment, method 450 may be performed by a FQDN preprocessor token parser, such as token parser 216 presented in fig. 2A. Method 450 may be an embodiment of

steps

312 and 316 depicted in fig. 3.

The method 450 of fig. 4B begins in step 452 by removing extraneous characters from the received FQDN to obtain a preprocessed candidate FQDN, and obtaining a candidate token from the label of the preprocessed candidate FQDN. These steps are the same as

steps

402 and 404 of method 400. Steps 456 to 460 are different from steps 406 to 410 from method 400, but result in the same set of candidate tokens. Steps 406 to 410 generate candidate tokens by creating each possible candidate token consisting of a certain number of labels from the preprocessed candidate FQDN. On the other hand, steps 456 to 460 generate all candidate tokens starting with a tag from the preprocessed candidate FQDN before moving to the next tag. For example, for the preprocessed candidate FQDN of "www.far.sight.com.", in step 454, a first candidate token is generated from the label of the preprocessed candidate FQDN, in this case, "www", "far", "sight", and "com". Step 456 then generates each candidate token starting with the label "www": "wwwfarsightcom", "wwwfarsight" and "wwwfar". Then, in step 458, the method proceeds to the next label, in this case the label "far", that appears in the preprocessed candidate FQDN. Step 460 will result in looping back to step 456 because the method has not traversed all tags. Step 456 will then repeat generating each candidate token starting with "far": "farsight com" and "farsight", and steps 450 to 460 will again repeat generating the candidate token "sigcom".

Thus, fig. 4A illustrates a method 400 and fig. 4B illustrates another method 450, both of which will generate the same set of candidate tokens from the same received FQDN. The skilled person will recognise that a large number of embodiments are possible for generating the candidate token.

Fig. 5 illustrates character strings generated in each step of the preprocessing method 400. In the schematic, the original FQDN 501 "aa- -aa- -a.b-bbb-b-b.cc-cc-c-dd-d" is received as part of the new DNS record. This original FQDN 501 may be determined during step 308 depicted in fig. 3. The FQDN may have various don't care characters. After the preprocessing step 402, a preprocessed candidate FQDN 502 is generated. In an embodiment, the preprocessed candidate FQDN 502 may be fed into a matching engine for comparison with the brand and domain name of the subscriber, as shown in step 313 of fig. 3.

In step 404, a label 504 is created by parsing the candidate FQDN based on the separator character (in this case, a dot). In this case, there are four tags 504A-D created. Although these four labels have been summarized in this example as "aaaaa", "bbbbbb", "cccc", and "ddd", in general these strings may represent the real subdomain of the DNS.

During steps 406 through 410, candidate tokens 508 are generated. The candidate token may be an embodiment of candidate token 217 illustrated in fig. 2A. At level 506A, the counter equals 1, and a candidate token consisting of one neighboring tag is generated, resulting in four candidate tokens 508A-D that are identical to the four tags 504A-D. At 506B, the counter equals 2, and candidate tokens 508E-G are generated that are the result of concatenating two consecutive labels from the preprocessed candidate FQDN 502 together. Thus, candidate token 508E is the concatenation of

tags

504C and 504D. It should be noted that for the preprocessed candidate FQDN 502, these are the only valid candidates generated when the counter is equal to 2, since there are no other consecutive combinations of two labels in the candidate FQDN 502.

At level 506C, the counter equals 3, and two candidate tokens 508H-I are generated, which are cascades of tags 504A-504C and 504B-504D, respectively. Finally, at level 506D, the counter equals 4, and only one candidate token 508J is generated, which is a concatenation of all tags 504A-D. The candidate tokens 508A-J reflect all valid candidate tokens that may be generated from the original FQDN 501. Thus, the candidate token 508 and the two FQDNs 501 and 502 are strings that a matching engine (such as matching engine 218 of fig. 2A) compares to the brand and domain name of the subscriber to determine whether the FQDN can be considered spoofed by a domain name of a lawless party.

Fig. 6 illustrates yet another example of a string that may be generated by the preprocessing module performing method 400, but is now directed to an FQDN that produces five tags instead of four. The string and its symbols are similar to those of fig. 5-original FQDN 601, candidate FQDN 602, label 604, and candidate token 608 generated based on different instances of the counter (depicted as levels 606A-E). Since the original FQDN 601 now produces five labels separated by the "." character, the number of candidate tokens 608 is substantially greater than the number of candidate tokens 508 of fig. 5.

As mentioned above, there are several embodiments of generating the same set of candidate tokens as candidate tokens 508 and 608 from fig. 5 and 6. For example, although counters are not used as described with respect to step 406 from method 400, method 450 depicted in fig. 4 will generate the same set of candidate tokens 508 and 608. Those skilled in the art will recognize that there may be many alternative ways to generate candidate tokens.

Fig. 6 illustrates two observations of importance. First, fig. 6 illustrates the importance of candidate tokens consisting of consecutive adjacent labels from a preprocessed candidate FQDN and concatenated in the order they appear in the candidate FQDN (such as preprocessed candidate FQDN 602). Since the purpose of domain name spoofing is to trick a user who is not wary of associating a spoofed domain name with a subscriber, the original FQDN 601 with which a lawless party can register with the DNS will attempt to mimic the brand and domain name associated with the subscriber, since this will most likely trick a user who is not wary. Thus, candidate tokens, such as 608H ("falls 1 ght") and 608L ("falls 1 lightcom"), reflect the phrase of the brand name "Farsight" near the subscriber. On the other hand, a string consisting of labels of the pre-processed candidate 602FQDN concatenated out of order, such as "s 1 lightfar" is less likely to be tricked into the user, or may reflect the brand name of a different legitimate entity. Thus, the requirement that the candidate tokens consist of consecutive adjacent and properly ordered tokens serves as a basic filter against the excessive sensitivity of the domain name spoofing detection system.

Methods

400 and 450 from fig. 4A and 4B, respectively, each create a set of candidate tokens that will reflect continuity.

Second, the number of candidate tokens is a function of the number of labels generated by parsing the candidate FQDN. Specifically, the number of candidate tokens is the "number of triangles" which is based on the number of labels from the preprocessed candidate FQDN:

n-number of labels from preprocessed candidate FQDN

C. Matching engine and algorithm

Fig. 7 depicts a matching engine 700. The matching engine 700 may be an exemplary embodiment of the matching engine 218 appearing in fig. 2, and may perform step 320 of the method 300 illustrated in fig. 3. The matching engine compares a candidate string (such as candidate token 217 or one of

FQDNs

213 and 215 depicted in fig. 2A) to a brand of a subscriber that may appear on a brand list of the subscriber (such as brand list 242 of fig. 2A).

In an exemplary embodiment, the matching engine may contain software instructions for executing three categories of matching algorithms (text algorithm 710, speech algorithm 720, and pictograph algorithm 730). Each of these categories of algorithms follows a different principle of matching the candidate token (or candidate FQDN) with the brand and domain name of the subscriber, such as an entry of brand list 242. In general, the literal algorithm category 710 attempts to find a literal string match between the candidate token (or FQDN) and the subscriber brand, the phonetic algorithm category 720 determines whether the pronunciation of the candidate token (or FQDN) is similar to the subscriber brand, and the pictographic algorithm category 730 determines whether the candidate token (or FQDN) has a textual appearance similar to the subscriber brand or domain name.

The literal algorithm class 710 includes software for performing two specific algorithms, i.e., Boyer-Moore matching and so-called "Leet Speak" matching. The phonetic Algorithm class 720 includes software for performing three specific algorithms (Double Metaphone matching, Metaphone 3 matching, and American soundx matching). Finally, the pictograph class includes software for performing two specific algorithms (leiwenstein distance matching and International Domain Name (IDN) homomorphic character matching). Each of these algorithms is described in more detail below with respect to fig. 8A-8B, 9A-9B, and 10A-10B.

Fig. 8A-8B illustrate algorithms of the literal algorithm class 710 depicted in fig. 7. As discussed above, the categories of literal algorithms represented by fig. 8A-8B seek to determine whether a string literal match exists between the search string and the target string-in other words, the algorithms determine whether the search string exists as an exact or near exact substring in the target string. As used in the domain name impersonation system 200 of fig. 2A, the objective is to determine whether the original FQDN, the preprocessed FQDN, or substrings within the candidate token are an exact match with the subscriber brand (exceptions are explained below), in which case the owner of the original FQDN may be attempting to impersonate the subscriber brand.

FIG. 8A illustrates the Boyer-Moore matching algorithm. Fig. 8A shows a subscriber brand 802, an FQDN 804A of a new DNS record, a preprocessed FQDN 804B (where all dashes are removed), and candidate tokens 806A-E based on the preprocessed FQDN 804B. The candidate tokens 806A-E may be generated by the token resolver 216 from FIG. 2A, which performs the method 400 from FIG. 4A or the method 450 from FIG. 4B. Candidate tokens 806A-E do not represent the entirety of candidate tokens generated based on FQDN 804B.

The Boyer-Moore algorithm is an optimized string matching algorithm from the literal Algorithm class 710 that performs pure substring matching, where the algorithm produces a match if the search string matches identically to a substring within the target string. In an exemplary embodiment, the matching engine 700 executes the algorithm with the search string as the subscriber brand 802 and the target string as any of the FQDNs 804A-B or candidate tokens 806A-E. In an exemplary embodiment,

FQDNs

804A and 804B are only compared to brand 802 using the Boyer-Moore algorithm. In other embodiments, the candidate tokens 806A-E may also be compared to brands using the Boyer-Moore algorithm. It can be seen that both

candidates

806A and 806C produce a match in the Boyer-Moore algorithm, since the subscriber brand 802 ("farsight") happens to appear in

candidates

806A and 806C. In an embodiment, Boyer-Moore is also case-insensitive, meaning that each character of the subscriber brand 802 need not be matched by letter case (upper or lower), but only by the actual letter that the character represents (e.g., "fArsIghtcom" will still produce a match from the search string "farsight"). It should be noted, however, that FQDNs 804A-B will not produce a match due to irrelevant periods when compared to subscriber brands 802 in the Boyer-Moore algorithm.

FIG. 8B illustrates another algorithm known as the "Leet Speak" matching algorithm. Leet Speak is another algorithm from the literal Algorithm class 710 similar to Boyer-Moore of FIG. 8A, which produces a match if the search string matches a substring within the target string. However, in these cases, the search string may be represented as a regular expression, where a particular character within the search string may be represented by several other characters, which are known to replace other characters.

Fig. 8B illustrates a subscriber brand 822, an original FQDN 824A of the new DNS record, a preprocessed FQDN 824B (where all dashes are removed), and candidate tokens 826A-E based on the preprocessed FQDN 824B. In an exemplary embodiment,

FQDNs

804A and 804B are only compared to brand 802 using the Leet Speak algorithm. In other embodiments, the Leet Speak algorithm may also be used to compare candidate tokens 806A-E to brands. In addition to these strings, a modified search string 823 is displayed. Modified search string 823 is a string formatted as a regular expression that allows a particular character to be replaced with other characters that appear similar. For example, in certain contexts (especially internet chat rooms, online video games, etc.), the letter "a" is often replaced with a "4" character, and when registering an FQDN to impersonate a company's brand, an illegal method may also take advantage of this. Similarly, the letter "i" may be replaced by the letter "l" or "1" numeric characters. The Leet Speak algorithm allows the search string to encompass all combinations of those commonly used character substitutions.

The modified search string 823 represents the ability to search for characters or their alternatives by bracketing all the characters that may represent each other. This representation is commonly referred to as a "character class". For example, "[ a4 ]" indicates that the "a" character can be replaced with a "4" character, "[ i1l ]" indicates that the "i" character can be replaced with a "1" or "l," [ s5] "indicates that the" s "character can be replaced with a" 5 "character, and so on. Thus, the modified search string 823 represents the subscriber brand 822, where any of these characters (or none of them) are interchanged with their public replacement. Thus, as can be seen in fig. 8B, the candidate 826A contains "fars 19htsecuri7 y" as the substring and generates a match, since the modified search string 823 allows the "g" character to be replaced with the "9" character, and allows the "i" character to be replaced with the "1" character. Similarly, candidate 826C also contains a "fars 19 ht" string, and a match is also generated. Two things should be noted in the Leet Speak algorithm: (i) regular occurrences of the candidate string "farsight" will also produce matches, and (ii) unrelated periods in both FQDNs 824A and 824B will still prevent those FQDNs from generating matches in the Leet Speak algorithm.

As discussed above, in embodiments, textual algorithms (including the Boyer-Moore algorithm and the Leet Speak algorithm) may be applied to the raw FQDN and the preprocessed FQDN to determine whether a match is generated.

Fig. 9A-9B illustrate the algorithms of the phonetic algorithm class 720 depicted in fig. 7. As discussed above, the phonetic algorithm categories represented by FIGS. 9A-9B attempt to determine whether two words have similar pronunciations. As used in the domain name impersonation system 200 of fig. 2A, the goal is to determine whether a portion of a domain name has a similar pronunciation to a subscriber brand, in which case the domain name may attempt to impersonate the subscriber brand.

FIG. 9A illustrates the "American Soundex" algorithm. The American Soundex algorithm processes the string to generate a code representing the pronunciation of the string. In the context of the matching engine 218 of fig. 2A, the algorithm is used to process both the subscriber brand and the candidate token, when their respective codes are compared to see if they are the same. If so, the algorithm generates a match and the matching engine can generate an alert report. Fig. 9A depicts a subscriber brand 902, an original FQDN 904A and a preprocessed FQDN 904B (with the dash removed), and a candidate token 906. After processing by the American Soundex algorithm, both the subscriber brand 902 and the candidate token 906 have

respective codes

903 and 907. Since the codes match (both codes are "F622"), the matching engine will generate a match and may issue an alarm report due to the match.

The American Soundex algorithm generates a four-character code based on the consonants of a character string. The first character of the code is the first letter of the string of characters processed by the algorithm. The presence of a particular character in the string indicates the last three characters in the code when read from left to right. For example, the letters "b", "f", "v", and "p" map to a value of 1, while the character "r" maps to a value of 6. Thus, for each instance of "b", "f", "v", or "p" occurrence, a "1" character is added to the four character code. Likewise, for each instance of the "r" character within the string, the "6" character is added to the four-digit code. There are several exceptions to these rules based on the repetition of characters within a string. Once four characters are reached, the process stops regardless of how many characters remain.

FIG. 9B illustrates both Double Metaphone and Metaphone 3 algorithms. The original algorithm (referred to as the "Metaphone algorithm" for short) is similar to the American Soundex algorithm in that codes are generated based on the pronunciation of words. The Metaphone algorithm is significantly more complex than the American Soundex algorithm. Both the Double Metaphone and Metaphone 3 algorithms are enhancements to the original Metaphone algorithm. In particular, the Metaphone 3 algorithm is an optimized version of the Double Metaphone algorithm, which in turn is an enhanced version of the Metaphone algorithm. Both the Double method and the method 3 algorithms generate two codes based on a string, rather than one code as occurs in the American Soundex and method algorithms. In the context of the matching engine 218 of fig. 2A, the algorithm is used to process both the subscriber brand and the candidate token, when their respective code pairs are compared to determine whether the candidate token is impersonating the subscriber brand.

Fig. 9B depicts a subscriber brand 922, an original FQDN 924A and a preprocessed FQDN 924B (with the dash removed) and a candidate token 926. The subscriber brand is processed by either the Double method algorithm or the method 3 algorithm to generate a pair of

codes

923A and 923B, and in a similar manner candidate token 926 is processed to generate a pair of

codes

927A and 927B. In an embodiment, each code of a subscriber brand must be matched with the corresponding code of its candidate token to generate a match. In another embodiment, only one code in the subscriber brand must be matched with its corresponding candidate token code to generate a match. In another embodiment, one code in the subscriber's brand need only match any code in the candidate token to generate a match.

Fig. 10A-10B illustrate algorithms of the pictograph algorithm class 730 depicted in fig. 7. As discussed above, the algorithms of the pictograph algorithm class represented by fig. 10A-10B attempt to determine whether the target string appears to be a search string. As used in the domain name spoofing system 200 of fig. 2A, the purpose is to determine whether the original FQDN, the preprocessed FQDN, or substrings within the candidate token have a similar appearance to the subscriber brand, in which case the owner of the original FQDN may be attempting to spoof the subscriber brand.

FIG. 10A illustrates the "Lewinstein distance" algorithm. The Laves distance algorithm accepts two strings and determines a "distance" score that reflects the degree of dissimilarity of the two strings. The distance score is an integer greater than or equal to zero and reflects the number of edits required to make their string entries identical to each other. For this reason, the distance score is sometimes referred to as an "edit distance". For example, the distance score of the character strings "fast" and "fast" is 1, and the distance score of the character strings "fast" and "design" is also 1, since both require an accurate edit to make the character strings the same (in the former case, the "a" character in the second character string is deleted, and in the latter case, the "o" character in the second character string is replaced with the "a" character string). In the context of the matching engine 218 of fig. 2A, the algorithm is used to generate a distance score based on the subscriber brand and candidate token, when the score is compared to a threshold value specified by the subscriber. If the distance is less than or equal to the threshold, then the algorithm generates a match.

Fig. 10A depicts a subscriber brand 1002, FQDN 1004A, preprocessed FQDN1004B, and candidate token 1006. The subscriber brand 1002 and the candidate token 1006 are fed into a leiwenstein distance calculator, at which point a distance score 1010 is generated. If the leistein distance is less than the threshold, a match is generated indicating that the FQDN is attempting to impersonate the subscriber brand or domain name.

FIG. 10B illustrates the International Domain Name (IDN) homomorphism word algorithm. International Domain Names (IDNs) are domain names that may use non-latin alphabetic characters representing letters such as cyrillic, greek, and amanita. Since the number of characters is many times greater than the number of latin characters, these characters are encoded in the unicode rather than ASCII. Although IDNs have the appearance of domain names (formatted using domain labels separated by periods), because they contain non-latin characters encoded in the unicode standard, the entire IDN string must also be encoded using the unicode standard. Although DNS is only designed for use with ASCII-encoded strings, application infrastructure has been developed and widely used that converts unicode domain names to ASCII domain names. This allows the IDN to be translated by the DNS into a network address as with FQDNS in a manner that is substantially transparent to ordinary internet users.

In the context of domain name impersonation, characters from non-latin letters are generally similar to latin characters and may be used by illegitimate parties to create domain names that have a similar appearance to a subscriber brand. These domain names are encoded as Unicode strings that include non-Latin alphabetic characters. For example, the cyrillic letter "aka" may be used in place of the latin letter "a", the greek character "τ" may be used in place of the latin letter "t", and so on. While applications available to the user (such as Web browsers or email applications) present the IDN as a regular string, such as "www.f radar securi τ y.com," at DNS the IDN will convert to an ASCII Punycode string, which results in a different IP address conversion. Thus, the IDN homomorphism word algorithm detects attempts to fake subscriber brands using IDNs with non-latin characters.

The algorithm is depicted in fig. 10B. In particular, fig. 10B depicts subscriber brand 1022, domain name 1024A, preprocessed domain name 1024B (with dashes removed), and candidate token 1026. Note that domain name 1024A contains the cyrillic letters "aka", so

domain names

1024A and 1024B are actually IDNs. The preprocessing and tokenization of the original IDN may occur in the same manner as the preprocessing and tokenization of a typical FQDN encoded in ASCII, where any operations are tailored to be performed against unicode encoded strings rather than ASCII strings. The original IDN and preprocessed IDNs 1024A-B are Unicode strings, as are candidate tokens 1026. Matching engine 218 will also generate multiple Unicode strings 1023A-D for subscriber brand 1022, where each Unicode string 1023A-D replaces one or more characters within ASCII encoded subscriber brand 1022 with non-Latin characters. It should be noted that in this embodiment, the number of Unicode strings is non-limiting — there may be a large number of Unicode strings based on subscriber brand 1022.

Unicode strings 1023A-D will be generated based on a predetermined large-scale database that maps Latin characters to non-Latin characters. The database can be periodically initiated and updated using Optical Character Recognition (OCR) to obtain a mapping between non-latin characters and latin characters based on how similar the non-latin characters and latin characters appear.

Using Unicode strings 1023A-D and candidate tokens 1024, a large number of comparison types can be generated. In one embodiment, the ASCII Punycode conversions for both Unicode strings 1023A-D and candidate tokens 1026 may be compared using the various algorithms described above to determine if a match exists. The ASCII Punycode conversion converts Unicode strings, which include several characters to represent a single non-Latin Unicode character, into an ASCII encoded string.

In other embodiments, a matching algorithm that is the same as or similar to the matching algorithm discussed above may be applied here, but is applicable to Unicode strings. In one embodiment, direct substring matching may be performed in a similar manner to Boyer-Moore, where the Unicode version of Boyer-Moore may be used. In another embodiment, the Laves distance algorithm may be applied in much the same way as done for ASCII strings, where a match occurs and an alarm report is generated if the distance score between one of Unicode strings 1023A-D and candidate token 1026 is greater than a certain value. For example, in FIG. 10B, under any of these algorithms, Unicode string 1023D will match candidate token 1024.

In yet another embodiment, rather than generating a set of Unicode strings 1023A-E from subscriber brands 1022, a set of ASCII strings may be generated from candidate tokens 1026. Those ASCII strings may then be matched with the subscriber brand 1022 based on other algorithms depicted in the matching engine 700, including algorithms of the literal algorithm class 710, the phonetic algorithm class 720, or the leistein distance algorithm (pictograph algorithm class 730).

D. Conclusion

The databases disclosed herein may be any stored type of structured memory, including persistent memory. In an example, the database may be implemented as a relational database or a file system.

Each of the processors and modules in fig. 2A, 2B, and 7 may be implemented with hardware, software, firmware, or any combination thereof implemented on a computing device. The computing device may include, but is not limited to, a device having a processor and a memory including a tangible, non-transitory memory for executing and storing instructions. The memory may tangibly embody data and program instructions. The software may include one or more applications and an operating system. The hardware may include, but is not limited to, a processor, memory, and a graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be part or all of a clustered or distributed computing environment or server farm.

Identifiers (such as "(a)", "(b)", "(i)", "(ii)" and the like) are sometimes used for different elements or steps. These identifiers are used for clarity and do not necessarily specify the order of the elements or steps.

The invention has been described above with the aid of functional building blocks illustrating embodiments of specific functions and relationships thereof. Boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for processing a domain name system, DNS, name string to obtain a plurality of candidate tokens to detect spoofing, the method comprising:

(a) receiving the DNS name string from a DNS sensor array;

(b) processing the DNS name string to generate a processed DNS name string;

(c) parsing the processed DNS name string based on a separator character to generate a plurality of labels;

(d) determining a total count of tags of the plurality of tags generated from the parsing in (c);

for each integer value between the total counts of the tags determined in (a) and (d):

(e) obtaining a subset of labels from the plurality of labels, the subset of labels consisting of a number of labels equal to the integer value, an

(f) Concatenating the labels of the subset of labels obtained in (e) according to an order in which the labels appear in the processed DNS name string when read in order to generate candidate tokens, wherein the candidate tokens are added to the plurality of candidate tokens, an

(g) For each of the plurality of candidate tokens, analyzing the candidate token to determine whether the candidate token matches a subscriber string from a plurality of subscriber strings.

2. The method of claim 1, wherein the processing (b) removes instances of dash characters from the DNS name string to generate the processed DNS name string.

3. The method of claim 1, wherein each of the plurality of labels corresponds to a substring within the processed DNS name string, wherein the substring occurs between successive instances of the separator character within the processed DNS name string.

4. The method of claim 3, wherein the subset of labels obtained in (e) comprises labels occurring consecutively in the processed DNS name string separated by the separator character.

5. The method of claim 3, further comprising, for each value between the total count of tags determined in one and (d):

(h) obtaining a second subset of tags from the plurality of tags, the second subset of tags including a number of tags equal to the integer value, wherein the second subset of tags is different from the subset of tags; and

(i) concatenating the labels from the second subset of labels to generate a second candidate token according to an order in which the labels appear in the processed DNS name string when read in order, wherein the second candidate token is added to the plurality of candidate tokens.

6. The method of claim 5, wherein the second subset of labels comprises labels from the plurality of labels that occur consecutively in the processed DNS name string separated by the separator character.

7. The method of claim 1, wherein the separator character is a period character.

8. The method of claim 1, wherein the DNS name string is in American Standard Code for Information Interchange (ASCII) format.

9. The method of claim 1, wherein the DNS name string is in a unicode format.

10. An apparatus comprising a memory device and a processor that processes a domain name system, DNS, name string to obtain a plurality of candidate tokens to detect spoofing, the memory device having stored thereon instructions that, when executed by the processor, cause the processor to:

(a) receiving the DNS name string from a DNS sensor array;

(b) processing the DNS name string to generate a processed DNS name string;

11. The apparatus of claim 10, wherein for process (b), the processor is configured to remove an instance of a dash character from the DNS name string to generate the processed DNS name string.

12. The apparatus of claim 10, wherein each of the plurality of labels corresponds to a substring within the processed DNS name string, wherein the substring occurs between successive instances of the separator character within the processed DNS name string.

13. The apparatus of claim 12, wherein the subset of labels obtained in (e) comprises labels occurring consecutively in the processed DNS name string separated by the separator character.

14. The apparatus of claim 12, wherein the processor is further configured to, for each integer value between one and the total count of tags determined in (d):

15. The apparatus of claim 14, wherein the second subset of labels comprises labels occurring consecutively in the processed DNS name string separated by the separator character.

16. The apparatus of claim 10, wherein the separator character is a period character.

17. The apparatus of claim 10, wherein the DNS name string is in American Standard Code for Information Interchange (ASCII) format.

18. The apparatus of claim 10, wherein the DNS name string is in a unicode format.

19. A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method for determining when at least one of a plurality of candidate tokens from a domain name system, DNS, name string spoofs at least one of a plurality of subscriber strings, the method comprising:

(a) receiving the DNS name string from a DNS sensor array;

(b) processing the DNS name string to generate a processed DNS name string;

(f) Concatenating the labels of the subset of labels obtained in (e) to generate candidate tokens according to an order in which the labels appear in the processed DNS name string when read in order, wherein the candidate tokens are added to the plurality of candidate tokens, an

(g) For each of the plurality of candidate tokens, analyzing the candidate token to determine whether the candidate token matches a subscriber string from the plurality of subscriber strings.

20. The non-transitory computer-readable medium of claim 19, wherein the processing (b) removes an instance of a dash character from the DNS name string to generate the processed DNS name string.

21. The non-transitory computer-readable medium of claim 19, wherein each of the plurality of labels corresponds to a substring within the processed DNS name string, wherein the substring occurs between consecutive instances of the separator character within the processed DNS name string.

22. The non-transitory computer-readable medium of claim 21, wherein the subset of labels obtained in (e) comprises labels occurring consecutively in the processed DNS name string separated by the separator character.

23. The non-transitory computer-readable medium of claim 21, the method further comprising, for each integer value between one and the total count of tags determined in (d):

24. The non-transitory computer-readable medium of claim 23, wherein the second subset of labels comprises labels occurring consecutively in the processed DNS name string separated by the separator character.

25. The non-transitory computer-readable medium of claim 19, wherein the separator character is a period character.

26. The non-transitory computer-readable medium of claim 19, wherein the DNS name string is in American Standard Code for Information Interchange (ASCII) format.

27. The non-transitory computer-readable medium of claim 19, wherein the DNS name string is in a unicode format.