WO2000026795A1 - Method for content-based filtering of messages by analyzing term characteristics within a message - Google Patents

Method for content-based filtering of messages by analyzing term characteristics within a message

Info

Publication number
WO2000026795A1
WO2000026795A1 PCT/US1999/024359 US9924359W WO2000026795A1 WO 2000026795 A1 WO2000026795 A1 WO 2000026795A1 US 9924359 W US9924359 W US 9924359W WO 2000026795 A1 WO2000026795 A1 WO 2000026795A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
junk
messages
term
message
method
Prior art date
Application number
PCT/US1999/024359
Other languages
French (fr)
Inventor
Mark Kantrowitz
Andrew Mccallum
Evan Bernstein
Original Assignee
Justsystem Pittsburgh Research Center, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/2715Statistical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00Arrangements for user-to-user messaging in packet-switching networks, e.g. e-mail or instant messages
    • H04L51/12Arrangements for user-to-user messaging in packet-switching networks, e.g. e-mail or instant messages with filtering and selective blocking capabilities

Abstract

A computer implemented method for document classification or filtering of junk messages comprises the steps of computing the sum of the product of the frequency of occurrence with an assigned term weight for every term from a term lexicon that also appears in the message, normalizing the resulting sum by dividing the result by the total number of words (or the number of unique words) in the document and assigning a score to the document based on the normalized sum.

Description

METHOD FOR CONTENT-BASED FILTERING OF MESSAGES BY ANALYZING TERM CHARACTERISΗCS WITHIN A MESSAGE

BACKGROUND OF THE INVENTION

In this patent, the term "junk messages" is used to refer to both junk e-mail messages and junk newsgroup 5 messages.

Junk messages represent a major and growing problem for the Internet and World Wide Web. Junk messages include many types of messages that the recipient does not wish to read, including messages containing unsolicited 10 commercial advertisements, chain letters, scams and frauds, such as multi-level marketing schemes and get-rich-quick schemes, advertisements for adult services and spam. (Spam is a vernacular term for messages that are posted to an excessive number of newsgroups.) 15 Junk messages are harmful because they shift the burden of determining importance from sender to recipient, externalizing the true costs of the junk. The sender has no direct incentive to consider the wishes of the recipient. Junk messages waste the recipient's time and

20 money. It takes time to download, identify and discard the junk messages. This buries important messages, causing a loss of productivity. If the recipient pays for connect time and telephone calls, the junk messages cost the recipient money, akin to postage due advertisements. On 25 flat-rate dial up services, the service provider pays for the junk messages in terms of wasted bandwidth and disk space. These costs are ultimately passed on to the recipient. The problem will continue to grow as more people become connected to the Internet .

30 Most current methods for filtering out junk messages use the headers of the message to identify the junk mail. These programs maintain extensive blacklists of the e-mail addresses, domain names and IP addresses of sources of junk messages and remove any messages from those sources. 35 They may also filter based on other header fields (e.g., peculiarities in the recipient address) or the telltale signs of forged message headers. Comparing two of the largest blacklists with a large corpus of junk messages found that this method identifies only about 70% of the junk messages .

Another popular method is to filter messages which were transmitted via blind carbon copy or a mailing list. Such messages can be easily identified because the recipient's address does not appear in the recipient fields of the header; but then the recipient must maintain a whitelist of legitimate sources of mail, such as his or her mailing list subscriptions and the e-mail addresses of colleagues who might send a message via blind carbon copy, to avoid filtering out legitimate messages. This heuristic would have caught only about 50% of the junk messages in our corpus .

To summarize, a blacklist is a list of header specifiers used to block messages and a whitelist is a list of header specifiers used to allow messages which would otherwise be filtered out to pass through the blockade.

Unfortunately, blacklists have many problems. They must be constantly updated as the large-scale offenders frequently change domain names and forge return addresses.

Many junk messages come from first-time offenders and hence cannot be detected using a blacklist . The offender can also address the messages individually with randomly selected forged return addresses. Header based methods also cannot detect messages transmitted via a mailing list to which the recipient subscribes, nor junk messages posted to newsgroups. The provider of a blacklist faces the possibility of litigation for defamation and restraint of trade, especially if legitimate users and domains are accidentally or intentionally included in the blacklist.

INSCRIPTION OF THE PRIOR ART W. Tietz, Electronic delivery of unwanted messages in open communicaf.ions systems, NTZ (Germany), 47(2):74-7,

February 1994. Cynthia Dwork and Moni Naor, Pricing via processing or combating Junk Mail, Weizmann Institute of Science, Department of Applied Mathematics and Computer Science, Technical Report CS95-20, 1995.

Douglas W. Oard and Gary Marchionini, A Conceptual Framework for Text Filtering, University of Maryland at College Park, Technical Report CS-TR-3643, May 1996.

Jason Rennie, ifile mail filtering system, http : //www . cs . emu . edu/~j r6b/ifile/ifile .

U.S. Patent No. 5,619,648 entitled "Message Filtering Techniques", Lucent Technologies Inc., filed November 30, 1994, issued April 8, 1997.

U.S. Patent No. 5,283,856 entitled "Event-Driven Rule-Based Messaging System", Beyond Inc., filed October 4, 1991, issued February 1, 1994. See also related U.S. Patent No. 5,555,346. U.S. Patent No. 5,627,764 entitled "Automatic

Electronic Messaging System With Feedback and Work Flow Administration", Banyan Systems, Inc., filed June 9, 1993, issued May 6, 1997.

U.S. Patent No. 5,377,354 entitled "Method and System for Sorting and Prioritizing Electronic Mail

Messages", Digital' Equipment Corporation, filed June 8, 1993, issued December 27, 1994.

There are numerous patents dealing with variations on the TFIDF method, including U.S. Patents Nos . 5,576,954; 5,659,766; 5,687,364; 5,371,807; and 5,675,819. The TFIDF computes the ratio of the frequency of each term in a document (TF) with the percentage of documents in which the term appears (IDF) . IDF stands for inverse term frequency.

TFIDF uses IDF to emphasize terms which occur frequently in the document but relatively rarely in the collection of documents. In contrast, TDTF disclosed herein tries to emphasize terms which occur frequently in the message and which are good indicators of junk messages (i.e., frequently in junk messages and rarely in non-junk messages). TD ("term discriminability" ) provides a good indicator of junk messages by measuring the precision of the terms for the specific purpose of classifying junk messages. TDTF computes the product of frequency of each term in the document (TF) with the term disriminability (TD) .

Mail filters in popular mail programs like Eudora have always been able to filter messages based on the presence of specific keywords in the message body. One could, for example, establish a Eudora filter that automatically deletes any message containing the word "sex". In fact, we use this capability for processing the mail that a plugin implementing this invention classifies as junk. The plugin adds a unique keyword to the message to indicate that it is junk, and the user can set up a Eudora filter that redirects the message to a special mailbox, deletes it, or takes some other action on the message. The present invention is more powerful than the simple Boolean keyword search in that it uses an extended vocabulary, with or without term weights, to distinguish junk messages from non- junk messages. With the Eudora filters, it is an all-or- nothing affair. If the keyword is present, it is classified as junk. If the keyword is not present, the message slips through the filter. The present invention measures the degree to which a message should be classified as junk. There are many words, like "money", which are ambiguous as to whether the message is junk or not. The present invention counts the frequency of occurrence of such terms, along with other common warning signs of junk messages, to provide a qualitative measure of whether a message is junk or not .

Although TFIDF, Naive Bayes, and similar methods have been used for filtering e-mail (see, for example, Jason Rennie's ifile system), they suffer from a sparse data problem. It is very hard for document similarity metrics like TFIDF and Naϊve Bayes to classify documents when they have very few exemplars of the class. Such metrics need large quantities of data in order to work. We address the sparse data problem by establishing a large, well-formulated query in advance by training on a large corpus of junk messages. Not only does this allow us to accurately identify junk messages without relying on the user to compile and maintain their own corpus of junk messages, but it works immediately, right out of the box. The idea of preparing a well-formulated query for a specific filtering task in advance represents an improvement to the state of the art. It is not possible to do this for the user's own classification system, in general, but for a sharply focused and important problem like eliminating junk messages, it is easy and effective. SUMMARY OF THE INVENTION

Briefly, according to this invention, there is provided a computer implemented method of filtering of junk messages by analyzing the content of the message instead of or in addition to using the message headers . This method involves document classification using a variety of information retrieval methods, but with unusually large queries. The term "queries", as used herein, refers to searches for terms in messages (or other documents) that match a list of terms (or lexicon) . In this invention, a list of terms may include multiple word n-grams . The present invention uses very large queries (on the order of 250, 500 or 1,000 query terms or more in the lexicon) to achieve extremely high accuracy in classifying documents. The key is to pick topics for which a large set of exemplars is available so that the large queries can be constructed. Besides using the invention to filter junk messages, other possible applications include identifying job announcements, categorizing classified advertisements (e.g., "for sale" versus "wanted", real estate, automobiles and so on), appropriateness for children and other well-defined categories. The present invention may also be used to classify web pages and newsgroup postings in addition to e- mail. Since the categories are static but are of widespread interest, the time invested in constructing large queries will be worthwhile and can be invested by the software manufacturer instead of the end-user. Junk mail, for example, is filtered by computing the sum of the product of the frequency of occurrence with the term weight for every term from the term lexicon that also appears in the message. The resulting sum is normalized by dividing the result by the total number of words (or the number of unique words) in the document. In other words, it is the dot product of the term frequency vector with the term weight vector perhaps normalized by document length. The key to the accuracy of this method is a large lexicon. This method permits alternate desired term weighting schemes.

According to a preferred method, the document or message is broken up into equal size chunks of the same number of words, with the score for the document taken as the maximum score for any chunk in the document. The last, odd-sized chunk may be merged into the previous chunk. Typical chunk sizes may be 50, 100 and 200 words.

According to one embodiment, the term weights are uniformly set equal to 1. According to another embodiment, a term's weight is its classification accuracy, as measured in a training corpus. Classification accuracy is the probability that the message is Junk given the Term is found in the message, that is, P(Junk | Term) . The term weights are adjusted to occur above a minimum term weight (e.g., .1%), so that terms which are not present in the training corpus have non-zero term weights. In yet another embodiment, the term weights are the information gain, log (P (Term | Junk)). This embodiment makes use of the Naive Bayes method, but modified to allow the use of word n-grams (bigrams, trigrams, etc.) in addition to word unigrams .

A novel method disclosed herein uses word n-gram statistics (including unigram, bigram, trigram and mixed- length n-grams) on message content to identify junk messages. Another novel method disclosed herein involves using a product of term weights with term frequencies. DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention uses a content-based method to identify the likelihood of a message being a junk message based on the content of the message itself. The language used in junk messages has characteristics that make it detectable. These methods offer a much higher accuracy than the prior art in correctly classifying messages as either junk or non-junk. The present invention has an accuracy that surpasses the effectiveness of header-based methods and is of sufficient accuracy to be used in stand-alone fashion to filter junk messages. However, there is no reason why it cannot be combined with header-based methods, and it is expected that this combination will be able to stop virtually all junk messages. Because the method is based on the content of the message with a rather fine-grained filter, the junk messages cannot be easily modified to bypass the filter.

The present invention automatically identifies whether a message, such as a piece. of e-mail or newsgroup posting, is junk; marks it as junk; and either automatically discards the message or automatically files it in a junk mail folder (directory or subdirectory) for later review and disposition by the user (with the name of the folder designated either by the program or by the user) . The present invention includes a user-settable threshold that determines whether a message is classified as junk or not. If the message's bogosity score is above the threshold, it is classified as junk. Otherwise, it is classified as non-junk. The user can set the threshold lower to let no junk through but occasionally misclassify real messages as junk. The user can set the threshold higher to catch most, but not all, of the junk messages while not misclassifying any of the real mail or the user can set the threshold somewhere between the two thresholds . This threshold may be set automatically to the value necessary to maximize the overall accuracy in classifying messages as junk or non-junk. Given a collection of messages classified correctly and a set of misclassified messages, it is a straightforward process to find the threshold value that minimizes the number of classification errors. Since the number of messages classified as junk decreases as the threshold increases and the number of real messages classified as junk decreases as the threshold decreases, there is a threshold value that minimizes the number of classification errors. Common search methods, like hill-climbing and binary search, can be used to find it. This is similar to the methods we described for adjusting the term weights in the lexicon, but applies to the threshold value instead of the lexicon weights .

There are many phrases which are quite common in junk messages but significantly less common in legitimate correspondence. Examples include "credit card", "please pardon the intrusion", "make money fast", "extremely lucrative opportunity", "dear adult webmaster", "completely legal", "opportunity of a lifetime", "check or money order", "credit repair", "very lucrative", "limited time offer", and "to be removed". A lexicon of such phrases may be compiled through a combination of automated methods and human judgment .

In one embodiment of this invention, referred to herein as the "bogosity" method , one measures the degree to which the content of the message relies on a restricted lexicon of terms common in junk mail. This yields a "junk density" or bogosity figure. The higher this figure, the greater the degree to which the message uses the telltale signs of junk, and hence the greater the likelihood that the message is junk. Given a junk density threshold, the system can classify as junk any message with a bogosity score above the threshold.

The bogosity method breaks up the messages into, say, 100 word chunks, and counts the number of word n-grams (multiple word phrases) in each chunk which also appear in the lexicon of phrases that are indicative of junk messages. The result is normalized by dividing it by the number of words in the chunk. The default chunk size can be set by the user. Typically, the chunk size will vary between 50 and 200. The bogosity score of the chunk with the highest bogosity score is used as the overall bogosity score of the message. The last chunk in the message may be less than the default chunk size. The bogosity method may ignore this chunk or merge it in with the previous chunk depending on the number of words in the chunk and the number of chunks in the message.

According to another embodiment, referred to herein as the TDTF method, weights are applied to each lexicon entry according to the Term Discriminability (classification accuracy) learned from a training corpus. Lexicon entries that are more indicative of junk will have higher weights than entries which are more ambiguous in nature. Negative weights are also permitted to allow the lexicon to include negative examples (e.g., good indicators of non-junk) . This is the TDTF algorithm, where TD stands for term discriminability and TF stands for term frequency.

A variation on the embodiments described uses a library of example junk messages in case-based fashion. The idea is to use the exemplar messages as lexicons and to use an algorithm like bogosity to measure the similarity between the incoming e-mail and each of the messages in the library. If the similarity score for any junk message in the library with the incoming message exceeds a threshold, the incoming message would be classified as junk. This is similar in implementation, although somewhat different in conception, with the difference deriving from the use of the exemplar messages themselves as the lexicons and the use of many smaller lexicons (corresponding to each of the exemplar messages) instead of one large lexicon.

According to yet another embodiment of this invention, use is made of the Naive Bayes statistical method that measures the information gain of classifying the messages using each word from the training corpus and computes the overall likelihood of each message. For example, the top 20 words in the junk class sorted by log likelihood values are: money, report, business, order, orders, mail, e-mail, receive, free, send, credit, bulk, marketing, internet, program, cash, service, people, opportunity and product. This matches our intuitions about what terms are good indicators of junk messages. The benefits of Naive Bayes are that it is a statistically well- founded technique which weights according to likelihood and incorporates notions of positive and negative weights by using separate scores for junk and non-junk and comparing the two .

A problem with Naϊve Bayes is the assumption that words occur independently. For example, the word "report" may be a good indicator of junk mail (many pyramid schemes use this word) , but it also filters out messages about progress reports . This problem is remedied by gathering statistics on word n-grams (e.g., word bigrams and trigrams) in addition to single words. At a basic level, the bogosity, TDTF, and Naive

Bayes methods are similar in implementation. They each maintain a lexicon of terms (single words, word bigrams, word trigrams and word n-grams in general, as well as word n-grams with stop words removed) with weights associated with each term. For bogosity the weight is set equal to 1. For TDTF the weight is the trained classification accuracy (term discriminability) of the term, which is equivalent to the probability that the message is junk given the term, P (Junk I Term) . For Naϊve Bayes, the weight is the information gain, which is the logarithm of the probability of the term, given that the message is junk, log (P(Term | Junk) ) .

Given these weights, the score for a document (or a chunk of a document) is the dot product (the sum of products, a linear combination of products) of the term frequencies with the corresponding term weights, perhaps normalized by document length.

Various methods have been used on a corpus of junk and non-junk messages, computing the accuracy in classifying junk and non-junk, as well as the overall classification accuracy. It is important not only that the method identify junk, but also that it not mistakenly identify non-junk as junk. Those skilled in the art can quickly write a program for scanning the corpus of junk documents to develop the weights for terms found in the documents.

When the TDTF algorithm's weights are trained using different data than was used to construct the lexicon, some lexicon terms might not appear in the training data. This can happen when human judgment is used to add simple variations to the lexicon terms (e.g., adding a new term that corrects a spelling error in a lexicon term) . The new term will not necessarily occur in the training data and so might be assigned to a score of 0. It is important to adjust the scores so that this term has a small non-zero value.

As noted previously, the junk accuracy of the heuristic (user not listed as a recipient) was about 50%, and the junk accuracy of blacklists was about 70%. The bogosity embodiment with a 0.20 threshold had a junk classification accuracy of about 90%, a non-junk classification accuracy of about 96% and an overall classification accuracy of about 95%. (Raising the threshold reduces the junk classification accuracy while increasing the non-junk classification accuracy. The 0.25 threshold seemed like a reasonable compromise.) The TDTF method with a threshold of 0.20 had junk, non-junk and overall classification accuracy scores of about 91%, 96% and 95%. Increasing the threshold to 0.25 reduced the junk accuracy to about 81% but increases the non-junk classification accuracy to 98%, with an overall accuracy of about 97%. The method using Naϊve Bayes with unigrams had a junk classification accuracy of about 97%, non-junk about 96% and overall 96% . The method using Naϊve Bayes with bigrams had a junk classification accuracy of about 98%, a real classification accuracy of about 98% and an overall classification accuracy of about 98%. Thus, the present invention represents a significant improvement to the state of the art .

Alternate implementations would involve several variations on the theme. For example, one implementation would train the lexicon on the user's own e-mail when the user installed the program. Another implementation would provide a ready-made lexicon and weights, and would allow the user to add new terms to the lexicon, delete terms from the lexicon and manually adjust the weights. Yet another implementation would also automatically adjust the weights when presented with new examples of junk and non-junk by small increments (for positive examples) and small decrements (for negative examples) for the terms found in the example. The increments and decrements would be computed using a variety of methods, such as gradient descent.

Prototypes of each of these methods have been implemented in Perl and C. It has been found it is quite useful in practice with Unix mail. It has been implemented as a plugin for the popular Windows and Macintosh mail program Eudora. The latest version also includes adjustable thresholds, whitelists and blacklists, and can highlight significant keywords in the e-mail message.

A copy of the PERL source code for a stand-alone version of bogosity and part of its lexicon follow. For an explanation of the PERL language, reference is made to

Learning Perl, Second Edition, by Randal L. Schwartz and Tom Christiansen (O'Reilly Sc Associates, Inc. 1997) . SOURCE CODE FOR BOGOSITY. PL

$rootdir = "C: \\usr\\mkant\\Bogosity\\ " ; $mailfile = $ARGV[0];

$mailfile = "mail.txt" if ( ! $mailfile) ; # the file of bogus words and phrases $phrasefile = "bogosity.txt";

# number of words per chunky $chunksize = $ARGV[1]; $chunksize = 200 if ( ! $chunksize) ; # Let -ly and -est contribute to bogosity $lyest = 1;

# For counting ! and ? $maxrictus = 0 ; $rictus = 0; # Load the phrase file. open(PHRASE, "$rootdir$phrasefile") ; foreach $phrase (<PHRASE>) { chop $phrase ; $phrases{"$phrase"} = 1; } close (PHRASE) ;

# process the mail $maxbogosity = 0; $wordcount = 0 ; $bogosity = 0; $prev = " " ; $pprev = " " ; $ppprev = " " ; $PPPP^ev = " " ; open(MAIL, "$rootdir$mailfile" ) ; foreach $line (<MAIL>) { chop $line;

$word !- /<URL:/i && $word !~ /http:\/\//i && $word !- /ftp:\/\//i && $word !- /name\s*=/i && $word !- /href\s*=/i &&

$word !- /gopher: \/\//i &&

$word !- /A / &&

$lword !- /A[\d\- (),\.\$]*$/ && $lword ne "") { $wordcount++; if ($phrases{"$lword"} == 1) {

$bogosity++; } elsif ($lyest == 1 &&

($lword =- /ly$|est$/)) { $bogosity++; if ($phrases{"$prev $lword" } == l) { $bogosity++;

} if ($phrases{ "$pprev $prev $lword" } == 1) {

$bogosity++;

} if ($phrases{"$pprev $pprev $prev $lword"} == 1) $bogosity++;

} if ($phrases{"$pppprev $ppprev $pprev $prev $lword"} == 1) {

$bogosity++; } if ($word =- /\?$|\!$/) { $rictus++;

} if ($wordcount >= $chunksize) { if ($bogosity > $maxbogosity) { $maxbogosity = $bogosity; if ($rictus > $maxrictus) {

$maxrictus = $rictus;

}

$wordcount = 0 ;

$bogosity = 0; $rictus = 0;

}

$pppprev = $ppprev; $ppprev = $pprev; $pprev = $prev; $prev = $lword;

close (MAIL) printf "Maximum Bogosity: %.3f ($maxbogosity/$chunksize) \n" , $maxbogos ity/$ chunks ize; printf "Maximum Chunk Rictus (!?) : %.3f

($maxrictus/$chunksize) \n" , $maxr ictus /$ chunks ize ;

BOGOSITY.TXT (partial)

I I

$

$$

$$$

$$$$

$$$$$

$$$$$$

$$$$$$$

$2.7 billion

$50,000

$50,000 dollars or more

$6.6 billion

$70,000

*this* mailing list

1,000

1,000,000

10,000

100%

100% committed

100% legal

100% of the time

100% satisfied

100,000

1000%

1302

1342

18 years old

1st level

1st time

200%

2nd level

3 level 300%

3rd level

4 level 400%

4th level

500%

5th level

8 level

90-day limited warranty

Four-level a brand new social security number a copy of a couple of a credit card a deep breath a different report a few a few hours a few minutes a large amount of money a leading a letter a limited number a list a little bit a little time a lot a lot easier a lot more a lot of a lot of money a lot of time a mail box a mailbox a mailing list a mailing list company a mailing of a miracle a month a must a sign a significant advantage a sound way a special program a testimonial a ton of money a top leader a total of a total of perhaps a variety of ability about to make absolutely absolutely convinced absolutely free absolutely guarantee absolutely guaranteed absolutely no credit check absolutely no other fees absolutely no risk absolutely nothing abuse accept all credit cards accept all major credit cards accept american express accept amex accept cash accept check accept checks accept credit cards accept creditcards accept major credit cards accept master accept mastercard accept money orders accept payment accept personal checks accept visa accept visa/master access fees account executive account number account representative acquiring e-mail lists acquiring email lists act fast act now action activity level ad ad banner ad below ad campaign ad length system added bonus additional income address address city addressed addresses addresses accurately

The program flow can generally be described as follows. The lexicon file containing the words and phrases characteristic of junk mail, "bogosity.txt", and the file containing the mail, "mail.txt", are opened. A word is input from the mail.txt file and compared to the lexicon. If a match is found the score for that word (in this case always the same) is added to the raw score. The first word is kept so that it along with the next word can be compared to double-word phrases in the lexicon. Words and phrases (in this case up to five-word phrases) are compared to the lexicon and scored. When the maximum chunk size has been read and compared to the lexicon, the total score is divided by the chunk size. The next chunk is then analyzed. A running maximum score for the chunks of the message is kept and used as the score for the message. If the last chunk is too short, it is merged with the next-to-last chunk or discarded. Finally, a line of text is added to the message to tag it as junk or not. Most mail programs have the capability of filing or discarding messages based upon this added line of text. This program is easily modified to implement the TDTF method and the Naϊve Bayes methods. The only difference is the use of different weights for terms in the lexicon.

Having thus defined our invention in the detail and particularity required by the Patent Laws, what is desired protected by Letters Patent is set forth in the following claims.

Claims

WE CLAIM :
1. A computer implemented method for filtering of junk messages comprising analyzing the content of the messages.
2. A computer implemented method for classification of a document as a junk message comprising analyzing the content of documents for the presence or absence of more than 250 words and/or multiple word n-grams.
3. A computer implemented method for classification of a document as a junk message comprising the steps of: a) computing the sum of the product of the frequency of occurrence with an assigned term weight for every term and/or multiple word n-grams from a term lexicon that also appears in a document; and b) assigning a score to the document based on the resulting sum.
4. The method according to claim 3 , comprising the step of normalizing the resulting sum by dividing the result by the total number of words (or the number of unique words) in the document.
5. The method according to claim 3 , wherein the document is broken up into equal sized chunks of the same number of words, with the score for the document as the maximum score for any chunk in the message.
6. The method according to claim 3, 4 or 5, comprising the further step of comparing the score assigned to the document to an adjustable threshold and classifying the document on the basis of that comparison.
7. The method according to claim 3, 4 or 5, wherein the term weights are uniformly set equal to 1.
8. The method according to claim 3, 4 or 5, wherein a term's weight is its classification accuracy P (Junk I Term) , as measured in a training corpus .
9. The method according to claim 3, 4 or 5, wherein the term weights are the information gain, log (P (Term | Junk)) as measured in a training corpus.
10. The method according to claim 3, 4 or 5, wherein the term weights are supplied by the dependency tree algorithm.
11. The method according to claim 3, 4 or 5, with any monotonic modification of the weights.
12. The method according to claim 3, 4 or 5, wherein the lexicon is comprised of a plurality of lexicons and a score is assigned to the document based upon maximum score using any one of the plurality of lexicons.
13. The method according to claim 12 , wherein the plurality of lexicons includes one or more junk messages.
14. The method according to claim 3 , 4 or 5 applied to e-mail documents or the like, wherein message headers are compared with a blacklist to block messages that match header-based constraints.
15. The method according to claim 3 , 4 or 5 applied to e-mail documents or the like, wherein message headers are compared with a whitelist to pass through messages that match header-based constraints .
16. The method according to claim 14, wherein only documents that are not blocked by the blacklist constraint are classified.
17. The method according to claim 15, wherein only documents that pass the whitelist constraint are classified.
18. The method according to claim 6 applied to e- mail documents or the like, wherein the user can set the threshold to let no junk mail through but occasionally misclassify a non-junk message as junk or the user can set the threshold to block most, but not all, junk messages while not misclassifying and non-junk messages or the threshold can be set somewhere therebetween.
19. The method according to claim 1, comprising a step for assigning a score to the document based on the content thereof which uses and falls with the likelihood that the message is junk and a step for comparing the score to a threshold to determine whether the message should be classified as junk.
20. The method according to claim 19, comprising the step for adjusting the threshold to control the balance between identifying junk messages and misclassifying non- junk messages.
21. The method according to claim 19, comprising a step for automatically setting the threshold to minimize all classification errors.
22. The method according to claim 3, 4 or 5, wherein the lexicon is derived from a training set of documents .
23. The method according to claim 22, wherein every term and/or multi-word n-gram in the lexicon has at least a minimum value.
PCT/US1999/024359 1998-10-30 1999-10-18 Method for content-based filtering of messages by analyzing term characteristics within a message WO2000026795A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18387198 true 1998-10-30 1998-10-30
US09/183,871 1998-10-30

Publications (1)

Publication Number Publication Date
WO2000026795A1 true true WO2000026795A1 (en) 2000-05-11

Family

ID=22674651

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/024359 WO2000026795A1 (en) 1998-10-30 1999-10-18 Method for content-based filtering of messages by analyzing term characteristics within a message

Country Status (1)

Country Link
WO (1) WO2000026795A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002006997A2 (en) * 2000-07-17 2002-01-24 Qualcomm Incorporated Method of and system for screening electronic mail items
WO2002013055A2 (en) * 2000-08-09 2002-02-14 Elron Software, Inc. Automatic categorization of documents based on textual content
GB2366706A (en) * 2000-08-31 2002-03-13 Content Technologies Ltd Monitoring email eg for spam,junk etc
WO2002056197A1 (en) * 2001-01-10 2002-07-18 Kluwer Academic Publishers B.V. System and method for electronic document handling
US6463430B1 (en) 2000-07-10 2002-10-08 Mohomine, Inc. Devices and methods for generating and managing a database
WO2002103604A1 (en) * 2001-06-14 2002-12-27 Apple Computer, Inc. Method and apparatus for filtering email
WO2003040875A2 (en) * 2001-11-02 2003-05-15 West Publishing Company Doing Business As West Group Systems, methods, and software for classifying documents
US6732157B1 (en) 2002-12-13 2004-05-04 Networks Associates Technology, Inc. Comprehensive anti-spam system, method, and computer program product for filtering unwanted e-mail messages
WO2004070627A1 (en) * 2003-02-10 2004-08-19 British Telecommunications Public Limited Company Determining a level of expertise of a text using classification and application to information retrival
WO2005043416A2 (en) * 2003-11-03 2005-05-12 Cloudmark, Inc. Methods and apparatuses for determining and designating classifications of electronic documents
EP1675330A1 (en) * 2004-12-21 2006-06-28 Lucent Technologies Inc. Unwanted message (SPAM) detection based on message content
US7529756B1 (en) 1998-07-21 2009-05-05 West Services, Inc. System and method for processing formatted text documents in a database
CN100563335C (en) 2007-04-19 2009-11-25 北京新岸线网络技术有限公司 Classified content auditing terminal system
US7778954B2 (en) 1998-07-21 2010-08-17 West Publishing Corporation Systems, methods, and software for presenting legal case histories
US7991720B2 (en) 1992-04-30 2011-08-02 Apple Inc. Method and apparatus for organizing information in a computer system
CN103092975A (en) * 2013-01-25 2013-05-08 武汉大学 Detection and filter method of network community garbage information based on topic consensus coverage rate
US8713027B2 (en) 2009-11-18 2014-04-29 Qualcomm Incorporated Methods and systems for managing electronic messages
CN104392362A (en) * 2014-11-06 2015-03-04 中国建设银行股份有限公司 Information processing method and device
US20150169511A1 (en) * 2012-06-25 2015-06-18 Beijing Qihoo Technology Company Limited System and method for identifying floor of main body of webpage
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5493692A (en) * 1993-12-03 1996-02-20 Xerox Corporation Selective delivery of electronic messages in a multiple computer system based on context and environment of a user
US5619648A (en) * 1994-11-30 1997-04-08 Lucent Technologies Inc. Message filtering techniques
US5659766A (en) * 1994-09-16 1997-08-19 Xerox Corporation Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5687364A (en) * 1994-09-16 1997-11-11 Xerox Corporation Method for learning to infer the topical content of documents based upon their lexical content
US5742769A (en) * 1996-05-06 1998-04-21 Banyan Systems, Inc. Directory with options for access to and display of email addresses
US5790935A (en) * 1996-01-30 1998-08-04 Hughes Aircraft Company Virtual on-demand digital information delivery system and method
US5826022A (en) * 1996-04-05 1998-10-20 Sun Microsystems, Inc. Method and apparatus for receiving electronic mail
US5832212A (en) * 1996-04-19 1998-11-03 International Business Machines Corporation Censoring browser method and apparatus for internet viewing
US5905863A (en) * 1996-06-07 1999-05-18 At&T Corp Finding an e-mail message to which another e-mail message is a response
US5963965A (en) * 1997-02-18 1999-10-05 Semio Corporation Text processing and retrieval system and method
US5999932A (en) * 1998-01-13 1999-12-07 Bright Light Technologies, Inc. System and method for filtering unsolicited electronic mail messages using data matching and heuristic processing

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5493692A (en) * 1993-12-03 1996-02-20 Xerox Corporation Selective delivery of electronic messages in a multiple computer system based on context and environment of a user
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5659766A (en) * 1994-09-16 1997-08-19 Xerox Corporation Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision
US5687364A (en) * 1994-09-16 1997-11-11 Xerox Corporation Method for learning to infer the topical content of documents based upon their lexical content
US5619648A (en) * 1994-11-30 1997-04-08 Lucent Technologies Inc. Message filtering techniques
US5790935A (en) * 1996-01-30 1998-08-04 Hughes Aircraft Company Virtual on-demand digital information delivery system and method
US5826022A (en) * 1996-04-05 1998-10-20 Sun Microsystems, Inc. Method and apparatus for receiving electronic mail
US5832212A (en) * 1996-04-19 1998-11-03 International Business Machines Corporation Censoring browser method and apparatus for internet viewing
US5742769A (en) * 1996-05-06 1998-04-21 Banyan Systems, Inc. Directory with options for access to and display of email addresses
US5905863A (en) * 1996-06-07 1999-05-18 At&T Corp Finding an e-mail message to which another e-mail message is a response
US5963965A (en) * 1997-02-18 1999-10-05 Semio Corporation Text processing and retrieval system and method
US5999932A (en) * 1998-01-13 1999-12-07 Bright Light Technologies, Inc. System and method for filtering unsolicited electronic mail messages using data matching and heuristic processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MARCHIONINI GARY: "A conceptual framework for text filtering", May 1996 (1996-05-01), pages 1 - 32, XP002923254 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7991720B2 (en) 1992-04-30 2011-08-02 Apple Inc. Method and apparatus for organizing information in a computer system
US8661066B2 (en) 1998-07-21 2014-02-25 West Service, Inc. Systems, methods, and software for presenting legal case histories
US7529756B1 (en) 1998-07-21 2009-05-05 West Services, Inc. System and method for processing formatted text documents in a database
US8250118B2 (en) 1998-07-21 2012-08-21 West Services, Inc. Systems, methods, and software for presenting legal case histories
US8600974B2 (en) 1998-07-21 2013-12-03 West Services Inc. System and method for processing formatted text documents in a database
US7778954B2 (en) 1998-07-21 2010-08-17 West Publishing Corporation Systems, methods, and software for presenting legal case histories
US6463430B1 (en) 2000-07-10 2002-10-08 Mohomine, Inc. Devices and methods for generating and managing a database
WO2002006997A3 (en) * 2000-07-17 2003-08-14 Qualcomm Inc Method of and system for screening electronic mail items
WO2002006997A2 (en) * 2000-07-17 2002-01-24 Qualcomm Incorporated Method of and system for screening electronic mail items
WO2002013055A2 (en) * 2000-08-09 2002-02-14 Elron Software, Inc. Automatic categorization of documents based on textual content
US6621930B1 (en) 2000-08-09 2003-09-16 Elron Software, Inc. Automatic categorization of documents based on textual content
WO2002013055A3 (en) * 2000-08-09 2003-09-18 Elron Software Inc Automatic categorization of documents based on textual content
GB2366706A (en) * 2000-08-31 2002-03-13 Content Technologies Ltd Monitoring email eg for spam,junk etc
GB2366706B (en) * 2000-08-31 2004-11-03 Content Technologies Ltd Monitoring electronic mail messages digests
US7801960B2 (en) 2000-08-31 2010-09-21 Clearswift Limited Monitoring electronic mail message digests
WO2002056197A1 (en) * 2001-01-10 2002-07-18 Kluwer Academic Publishers B.V. System and method for electronic document handling
US7856479B2 (en) 2001-06-14 2010-12-21 Apple Inc. Method and apparatus for filtering email
US7836135B2 (en) 2001-06-14 2010-11-16 Apple Inc. Method and apparatus for filtering email
US7076527B2 (en) 2001-06-14 2006-07-11 Apple Computer, Inc. Method and apparatus for filtering email
WO2002103604A1 (en) * 2001-06-14 2002-12-27 Apple Computer, Inc. Method and apparatus for filtering email
WO2003040875A3 (en) * 2001-11-02 2003-08-07 West Publishing Company Doing Systems, methods, and software for classifying documents
US7580939B2 (en) 2001-11-02 2009-08-25 Thomson Reuters Global Resources Systems, methods, and software for classifying text from judicial opinions and other documents
WO2003040875A2 (en) * 2001-11-02 2003-05-15 West Publishing Company Doing Business As West Group Systems, methods, and software for classifying documents
EP2012240A1 (en) * 2001-11-02 2009-01-07 Thomson Reuters Global Resources Systems, methods, and software for classifying documents
US6732157B1 (en) 2002-12-13 2004-05-04 Networks Associates Technology, Inc. Comprehensive anti-spam system, method, and computer program product for filtering unwanted e-mail messages
WO2004070627A1 (en) * 2003-02-10 2004-08-19 British Telecommunications Public Limited Company Determining a level of expertise of a text using classification and application to information retrival
WO2005043416A3 (en) * 2003-11-03 2005-07-21 Cloudmark Inc Methods and apparatuses for determining and designating classifications of electronic documents
WO2005043416A2 (en) * 2003-11-03 2005-05-12 Cloudmark, Inc. Methods and apparatuses for determining and designating classifications of electronic documents
JP2006178998A (en) * 2004-12-21 2006-07-06 Lucent Technol Inc Detection of annoying message (spam) based on message content
EP1675330A1 (en) * 2004-12-21 2006-06-28 Lucent Technologies Inc. Unwanted message (SPAM) detection based on message content
KR101170562B1 (en) 2004-12-21 2012-08-01 알카텔-루센트 유에스에이 인코포레이티드 Unwanted messagespam detection based on message content
CN100563335C (en) 2007-04-19 2009-11-25 北京新岸线网络技术有限公司 Classified content auditing terminal system
US8713027B2 (en) 2009-11-18 2014-04-29 Qualcomm Incorporated Methods and systems for managing electronic messages
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data
US20150169511A1 (en) * 2012-06-25 2015-06-18 Beijing Qihoo Technology Company Limited System and method for identifying floor of main body of webpage
CN103092975A (en) * 2013-01-25 2013-05-08 武汉大学 Detection and filter method of network community garbage information based on topic consensus coverage rate
CN104392362A (en) * 2014-11-06 2015-03-04 中国建设银行股份有限公司 Information processing method and device
CN104392362B (en) * 2014-11-06 2018-03-23 中国建设银行股份有限公司 Information processing method and apparatus

Similar Documents

Publication Publication Date Title
Yih et al. Finding advertising keywords on web pages
Yin et al. Detection of harassment on web 2.0
Yang et al. A comparative study on feature selection in text categorization
US6405197B2 (en) Method of constructing and displaying an entity profile constructed utilizing input from entities other than the owner
US6970881B1 (en) Concept-based method and system for dynamically analyzing unstructured information
Cormack et al. Spam filtering for short messages
Sebastiani Text categorization
US6397215B1 (en) Method and system for automatic comparison of text classifications
US7945600B1 (en) Techniques for organizing data to support efficient review and analysis
Androutsopoulos et al. An evaluation of naive bayesian anti-spam filtering
Guzella et al. A review of machine learning approaches to spam filtering
US7359891B2 (en) Hot topic extraction apparatus and method, storage medium therefor
US7349901B2 (en) Search engine spam detection using external data
US6154783A (en) Method and apparatus for addressing an electronic document for transmission over a network
US6115709A (en) Method and system for constructing a knowledge profile of a user having unrestricted and restricted access portions according to respective levels of confidence of content of the portions
US6928465B2 (en) Redundant email address detection and capture system
US7472114B1 (en) Method and apparatus to define the scope of a search for information from a tabular data source
US6832224B2 (en) Method and apparatus for assigning a confidence level to a term within a user knowledge profile
GuoDong et al. Exploring various knowledge in relation extraction
US20100030798A1 (en) Systems and Methods for Tagging Emails by Discussions
Yang et al. Near-duplicate detection by instance-level constrained clustering
US20090327243A1 (en) Personalization engine for classifying unstructured documents
US20090193011A1 (en) Phrase Based Snippet Generation
Hidalgo Evaluating cost-sensitive unsolicited bulk email categorization
US20020055940A1 (en) Method and system for selecting documents by measuring document quality

Legal Events

Date Code Title Description
ENP Entry into the national phase in:

Ref country code: AU

Ref document number: 2000 11221

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ CZ DE DE DK DK DM EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase