Chinese spam content classification and authentication algorithm based on Bayes
Technical field
The present invention mainly is on the basis of Bayes, designs a kind of algorithm of Chinese spam information filtering.More specifically, the present invention uses the Bayes principle, proposes a kind of new feature selecting parameter---likelihood ratio logarithm simultaneously, comes Chinese spam is filtered with this.
Background technology
The Spam filtering technology mainly comprises mail head's filtering technique and Mail Contents filtering technique.Mail head's filtering technique is only considered each stature field of mail, and this technology can just be blocked before mail is submitted to fully; The Mail Contents filtering technique is then mainly analyzed the content of text of mail.
On content, Spam filtering can be regarded a two-value classification problem as: mail is divided into spam and normal email two classes.Therefore, various file classification methods may be used to the filtration of spam, as rule-based Ripper algorithm, decision tree C4.5 algorithm, Boosting method, Rough Set method, and based on the support vector machine of adding up, kNN algorithm, bayes classification method etc.Through a large amount of studies show that, bayes method has shown better accuracy rate with respect to other sorting technique, and bayes method has the function of learning adaptive, can upgrade the feature database that is adopted along with the latest features of spam.
The key of bayesian algorithm is exactly a feature selecting.The order of accuarcy of feature selecting has determined the order of accuarcy of the vector of expression mail, has determined the order of accuarcy to classification of mail.
Existing Bayes's implementation algorithm has all adopted based on the feature selection approach of document frequency (Document Frequency) with based on the feature selection approach of mutual information (Mutual Information).In these feature selection approachs, low-frequency word becomes the key that effect characteristics is selected accuracy to the contribution of classification.
Summary of the invention
The present invention proposes that a kind of new feature selecting parameter---the likelihood ratio logarithm adapts to the specific demand of Spam filtering with this.
For entry w and class variable c, in vector x, x
iI entry in=0 expression vector space x do not occur in classification c
iI entry of=1 expression occurs in classification c.If N is the mail total sample number, NL is the normal email number, and NS is the spam number.NS
XiBe X in the spam
i=x
iThe mail number, NL
XiBe X in the normal email
i=x
iThe mail number, like this, work as x
i=1 o'clock, Ns
XiWith NL
XiRepresent to comprise in spam and the normal email mail number of i entry respectively, claim
Be x
i1 likelihood, be designated as likelihoodl (w); Work as x
i=0 o'clock, Ns
XiWith NL
XiRepresent not comprise with normal email in the spam mail number of i entry respectively, this moment
Be called x
i0 likelihood, be designated as likelihood0 (w).The likelihood ratio logarithm LL (w) of entry w in feature database is
The likelihood ratio logarithm has been considered entry influence to classifying under appearance and absent variable two kinds of situations.This value is big more, represents that the spam of this entry under appearance and absent variable two kinds of situations and the number ratio of normal email differ big more, just mean that also this entry is big more to the influence of classification.
The specific implementation algorithm
1, training algorithm
Initialization vocabulary V
Each good mail of classifying of vectorization;
SpamNum=training set spam sum
LegiNum=training set normal email sum
FOREACHword∈V
DO
n
WinspamNumber appears in word in=0 ∥ spam
n
WinlegitNumber appears in word in=0 ∥ normal email
FOR?EACH?mail∈M
DO
IF?mail?INSTANCEOF?Spam?THEN
n
winspam=n
winspam+1
ELSE
n
winlegit=n
winlegit+1
DONE
Save?n
winspam&n
inlegit&Likelihood0&Likelihood1&LL
DONE.
According to LL V is sorted, and select n word of LL maximum
2, sorting algorithm
DEFINEλ
THRE
FOR?mail?TO?CLASSIFY?DO
λ=1
FOR?EACH?word∈V
DO
IF?word∈mail
λ=λ*word[Likelihood1]
ELSE
λ=λ*word[Likelihood0]
DONE
IFλ=λ
THRE?THEN
CLASSIFY?mail?AS?SPAM
ELSE
CLASSIFY?mail?AS?LEGITIMATE?MAIL
DONE
3, learning algorithm
FOR?mail?TO?STUDYDO
λ=1
FOR?EACH?word∈mail?DO
CASE?mail?OF
SPAM:
n
winspam=n
winspam+1
SpamNum=SpamNum+1
LEGITIMATE?MALL:
n
winlegit=n
winlegit+1
LegitNum=LegitNum+1
DONE
DONE
Description of drawings
Fig. 1 is that bayesian algorithm is realized class figure;
Fig. 2 is based on the filtrator of bayesian algorithm and realizes class figure.