CN1790405A

CN1790405A - Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email

Info

Publication number: CN1790405A
Application number: CNA2005101356033A
Authority: CN
Inventors: 钱德沛
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-12-31
Filing date: 2005-12-31
Publication date: 2006-06-21

Abstract

The invention discloses a new character selection parameter--likelihood ratio logarithm, which is characterized by the following: classifying the refuse mail to apply; synthesizing the classification influence in the appearing and disappearing low-frequency word conditions in the mail; making the Bayes classification course to utilize the calculation result of character selection course; improving the recalling rate and classification property of classification result.

Description

Chinese spam content classification and authentication algorithm based on Bayes

Technical field

The present invention mainly is on the basis of Bayes, designs a kind of algorithm of Chinese spam information filtering.More specifically, the present invention uses the Bayes principle, proposes a kind of new feature selecting parameter---likelihood ratio logarithm simultaneously, comes Chinese spam is filtered with this.

Background technology

The Spam filtering technology mainly comprises mail head's filtering technique and Mail Contents filtering technique.Mail head's filtering technique is only considered each stature field of mail, and this technology can just be blocked before mail is submitted to fully; The Mail Contents filtering technique is then mainly analyzed the content of text of mail.

On content, Spam filtering can be regarded a two-value classification problem as: mail is divided into spam and normal email two classes.Therefore, various file classification methods may be used to the filtration of spam, as rule-based Ripper algorithm, decision tree C4.5 algorithm, Boosting method, Rough Set method, and based on the support vector machine of adding up, kNN algorithm, bayes classification method etc.Through a large amount of studies show that, bayes method has shown better accuracy rate with respect to other sorting technique, and bayes method has the function of learning adaptive, can upgrade the feature database that is adopted along with the latest features of spam.

The key of bayesian algorithm is exactly a feature selecting.The order of accuarcy of feature selecting has determined the order of accuarcy of the vector of expression mail, has determined the order of accuarcy to classification of mail.

Existing Bayes's implementation algorithm has all adopted based on the feature selection approach of document frequency (Document Frequency) with based on the feature selection approach of mutual information (Mutual Information).In these feature selection approachs, low-frequency word becomes the key that effect characteristics is selected accuracy to the contribution of classification.

Summary of the invention

The present invention proposes that a kind of new feature selecting parameter---the likelihood ratio logarithm adapts to the specific demand of Spam filtering with this.

For entry w and class variable c, in vector x, x _iI entry in=0 expression vector space x do not occur in classification c _iI entry of=1 expression occurs in classification c.If N is the mail total sample number, NL is the normal email number, and NS is the spam number.NS _XiBe X in the spam _i=x _iThe mail number, NL _XiBe X in the normal email _i=x _iThe mail number, like this, work as x _i=1 o'clock, Ns _XiWith NL _XiRepresent to comprise in spam and the normal email mail number of i entry respectively, claim Be x _i1 likelihood, be designated as likelihoodl (w); Work as x _i=0 o'clock, Ns _XiWith NL _XiRepresent not comprise with normal email in the spam mail number of i entry respectively, this moment Be called x _i0 likelihood, be designated as likelihood0 (w).The likelihood ratio logarithm LL (w) of entry w in feature database is

LL (w) = | \log \frac{likelihood 0 (w)}{likelihood 1 (w)} | .

The likelihood ratio logarithm has been considered entry influence to classifying under appearance and absent variable two kinds of situations.This value is big more, represents that the spam of this entry under appearance and absent variable two kinds of situations and the number ratio of normal email differ big more, just mean that also this entry is big more to the influence of classification.

The specific implementation algorithm

1, training algorithm

Initialization vocabulary V

Each good mail of classifying of vectorization;

SpamNum=training set spam sum

LegiNum=training set normal email sum

FOREACHword∈V

DO

n _WinspamNumber appears in word in=0 ∥ spam

n _WinlegitNumber appears in word in=0 ∥ normal email

FOR?EACH?mail∈M

DO

IF?mail?INSTANCEOF?Spam?THEN

n _winspam＝n _winspam+1

ELSE

n _winlegit＝n _winlegit+1

DONE

Likelihood 0 = \frac{1 - n_{winspam} / SpamNum}{1 - n_{winlegit} / LegiNum}

Likelihood 1 = \frac{n_{winspam} / SpamNum}{n_{winlegit} / LegiNum}

LL = | \log \frac{Likelihood 0}{Likelihood 1} |

Save?n _winspam&n _inlegit&Likelihood0&Likelihood1&LL

DONE.

According to LL V is sorted, and select n word of LL maximum

2, sorting algorithm

DEFINEλ _THRE

FOR?mail?TO?CLASSIFY?DO

λ＝1

FOR?EACH?word∈V

DO

IF?word∈mail

λ＝λ*word[Likelihood1]

ELSE

λ＝λ*word[Likelihood0]

DONE

IFλ＝λ _THRE?THEN

CLASSIFY?mail?AS?SPAM

ELSE

CLASSIFY?mail?AS?LEGITIMATE?MAIL

DONE

3, learning algorithm

FOR?mail?TO?STUDYDO

λ＝1

FOR?EACH?word∈mail?DO

CASE?mail?OF

SPAM：

n _winspam＝n _winspam+1

SpamNum＝SpamNum+1

LEGITIMATE?MALL：

n _winlegit＝n _winlegit+1

LegitNum＝LegitNum+1

DONE

Likelihood 0 = \frac{1 - n_{winspam} / SpamNum}{1 - n_{winlegit} / LegiNum}

Likelihood 1 = \frac{n_{winspam} / SpamNum}{n_{winlegit} / LegiNum}

LL = | \log \frac{Likelihood 0}{Likelihood 1} |

DONE

Description of drawings

Fig. 1 is that bayesian algorithm is realized class figure;

Fig. 2 is based on the filtrator of bayesian algorithm and realizes class figure.

Claims

1, a kind of Chinese spam content classification and authentication algorithm based on Bayes, this algorithm mainly is that the Bayes algorithm principle is applied on the Chinese Spam filtering.Simultaneously on the Bayes basis, a new feature selecting parameter---likelihood ratio logarithm has been proposed, it is specially at the spam classification application, combine that low-frequency word occurs in mail and absent variable two kinds of situations under to the influence of classification, make the Bayes process can utilize the result of calculation of feature selection process.