CN1790405A - Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email - Google Patents

Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email Download PDF

Info

Publication number
CN1790405A
CN1790405A CNA2005101356033A CN200510135603A CN1790405A CN 1790405 A CN1790405 A CN 1790405A CN A2005101356033 A CNA2005101356033 A CN A2005101356033A CN 200510135603 A CN200510135603 A CN 200510135603A CN 1790405 A CN1790405 A CN 1790405A
Authority
CN
China
Prior art keywords
classification
mail
bayes
spam
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2005101356033A
Other languages
Chinese (zh)
Inventor
钱德沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA2005101356033A priority Critical patent/CN1790405A/en
Publication of CN1790405A publication Critical patent/CN1790405A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a new character selection parameter--likelihood ratio logarithm, which is characterized by the following: classifying the refuse mail to apply; synthesizing the classification influence in the appearing and disappearing low-frequency word conditions in the mail; making the Bayes classification course to utilize the calculation result of character selection course; improving the recalling rate and classification property of classification result.

Description

Chinese spam content classification and authentication algorithm based on Bayes
Technical field
The present invention mainly is on the basis of Bayes, designs a kind of algorithm of Chinese spam information filtering.More specifically, the present invention uses the Bayes principle, proposes a kind of new feature selecting parameter---likelihood ratio logarithm simultaneously, comes Chinese spam is filtered with this.
Background technology
The Spam filtering technology mainly comprises mail head's filtering technique and Mail Contents filtering technique.Mail head's filtering technique is only considered each stature field of mail, and this technology can just be blocked before mail is submitted to fully; The Mail Contents filtering technique is then mainly analyzed the content of text of mail.
On content, Spam filtering can be regarded a two-value classification problem as: mail is divided into spam and normal email two classes.Therefore, various file classification methods may be used to the filtration of spam, as rule-based Ripper algorithm, decision tree C4.5 algorithm, Boosting method, Rough Set method, and based on the support vector machine of adding up, kNN algorithm, bayes classification method etc.Through a large amount of studies show that, bayes method has shown better accuracy rate with respect to other sorting technique, and bayes method has the function of learning adaptive, can upgrade the feature database that is adopted along with the latest features of spam.
The key of bayesian algorithm is exactly a feature selecting.The order of accuarcy of feature selecting has determined the order of accuarcy of the vector of expression mail, has determined the order of accuarcy to classification of mail.
Existing Bayes's implementation algorithm has all adopted based on the feature selection approach of document frequency (Document Frequency) with based on the feature selection approach of mutual information (Mutual Information).In these feature selection approachs, low-frequency word becomes the key that effect characteristics is selected accuracy to the contribution of classification.
Summary of the invention
The present invention proposes that a kind of new feature selecting parameter---the likelihood ratio logarithm adapts to the specific demand of Spam filtering with this.
For entry w and class variable c, in vector x, x iI entry in=0 expression vector space x do not occur in classification c iI entry of=1 expression occurs in classification c.If N is the mail total sample number, NL is the normal email number, and NS is the spam number.NS XiBe X in the spam i=x iThe mail number, NL XiBe X in the normal email i=x iThe mail number, like this, work as x i=1 o'clock, Ns XiWith NL XiRepresent to comprise in spam and the normal email mail number of i entry respectively, claim Be x i1 likelihood, be designated as likelihoodl (w); Work as x i=0 o'clock, Ns XiWith NL XiRepresent not comprise with normal email in the spam mail number of i entry respectively, this moment Be called x i0 likelihood, be designated as likelihood0 (w).The likelihood ratio logarithm LL (w) of entry w in feature database is
LL ( w ) = | log likelihood 0 ( w ) likelihood 1 ( w ) | .
The likelihood ratio logarithm has been considered entry influence to classifying under appearance and absent variable two kinds of situations.This value is big more, represents that the spam of this entry under appearance and absent variable two kinds of situations and the number ratio of normal email differ big more, just mean that also this entry is big more to the influence of classification.
The specific implementation algorithm
1, training algorithm
Initialization vocabulary V
Each good mail of classifying of vectorization;
SpamNum=training set spam sum
LegiNum=training set normal email sum
FOREACHword∈V
DO
n WinspamNumber appears in word in=0 ∥ spam
n WinlegitNumber appears in word in=0 ∥ normal email
FOR?EACH?mail∈M
DO
IF?mail?INSTANCEOF?Spam?THEN
n winspam=n winspam+1
ELSE
n winlegit=n winlegit+1
DONE
Likelihood 0 = 1 - n winspam / SpamNum 1 - n winlegit / LegiNum
Likelihood 1 = n winspam / SpamNum n winlegit / LegiNum
LL = | log Likelihood 0 Likelihood 1 |
Save?n winspam&n inlegit&Likelihood0&Likelihood1&LL
DONE.
According to LL V is sorted, and select n word of LL maximum
2, sorting algorithm
DEFINEλ THRE
FOR?mail?TO?CLASSIFY?DO
λ=1
FOR?EACH?word∈V
DO
IF?word∈mail
λ=λ*word[Likelihood1]
ELSE
λ=λ*word[Likelihood0]
DONE
IFλ=λ THRE?THEN
CLASSIFY?mail?AS?SPAM
ELSE
CLASSIFY?mail?AS?LEGITIMATE?MAIL
DONE
3, learning algorithm
FOR?mail?TO?STUDYDO
λ=1
FOR?EACH?word∈mail?DO
CASE?mail?OF
SPAM:
n winspam=n winspam+1
SpamNum=SpamNum+1
LEGITIMATE?MALL:
n winlegit=n winlegit+1
LegitNum=LegitNum+1
DONE
Likelihood 0 = 1 - n winspam / SpamNum 1 - n winlegit / LegiNum
Likelihood 1 = n winspam / SpamNum n winlegit / LegiNum
LL = | log Likelihood 0 Likelihood 1 |
DONE
Description of drawings
Fig. 1 is that bayesian algorithm is realized class figure;
Fig. 2 is based on the filtrator of bayesian algorithm and realizes class figure.

Claims (1)

1, a kind of Chinese spam content classification and authentication algorithm based on Bayes, this algorithm mainly is that the Bayes algorithm principle is applied on the Chinese Spam filtering.Simultaneously on the Bayes basis, a new feature selecting parameter---likelihood ratio logarithm has been proposed, it is specially at the spam classification application, combine that low-frequency word occurs in mail and absent variable two kinds of situations under to the influence of classification, make the Bayes process can utilize the result of calculation of feature selection process.
CNA2005101356033A 2005-12-31 2005-12-31 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email Pending CN1790405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2005101356033A CN1790405A (en) 2005-12-31 2005-12-31 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2005101356033A CN1790405A (en) 2005-12-31 2005-12-31 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email

Publications (1)

Publication Number Publication Date
CN1790405A true CN1790405A (en) 2006-06-21

Family

ID=36788233

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2005101356033A Pending CN1790405A (en) 2005-12-31 2005-12-31 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email

Country Status (1)

Country Link
CN (1) CN1790405A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100592692C (en) * 2007-09-27 2010-02-24 南京大学 Conditional mutual information based network intrusion classification method of double-layer semi-idleness Bayesian
CN101257378B (en) * 2008-04-09 2010-06-02 南京航空航天大学 Anti-disclosure mail safe card and method for detecting disclosure mail
CN101374122B (en) * 2007-08-24 2011-05-04 赛门铁克公司 Filtering beayes assurance check in the content of non-training language to reduce false positive
CN101150529B (en) * 2006-09-21 2011-07-27 腾讯科技(深圳)有限公司 A method and system for mail search
CN104834640A (en) * 2014-02-10 2015-08-12 腾讯科技(深圳)有限公司 Webpage identification method and apparatus
CN105447505A (en) * 2015-11-09 2016-03-30 成都数之联科技有限公司 Multilevel important email detection method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150529B (en) * 2006-09-21 2011-07-27 腾讯科技(深圳)有限公司 A method and system for mail search
CN101374122B (en) * 2007-08-24 2011-05-04 赛门铁克公司 Filtering beayes assurance check in the content of non-training language to reduce false positive
CN100592692C (en) * 2007-09-27 2010-02-24 南京大学 Conditional mutual information based network intrusion classification method of double-layer semi-idleness Bayesian
CN101257378B (en) * 2008-04-09 2010-06-02 南京航空航天大学 Anti-disclosure mail safe card and method for detecting disclosure mail
CN104834640A (en) * 2014-02-10 2015-08-12 腾讯科技(深圳)有限公司 Webpage identification method and apparatus
US10452725B2 (en) 2014-02-10 2019-10-22 Tencent Technology (Shenzhen) Company Limited Web page recognizing method and apparatus
CN105447505A (en) * 2015-11-09 2016-03-30 成都数之联科技有限公司 Multilevel important email detection method
CN105447505B (en) * 2015-11-09 2018-12-18 成都数之联科技有限公司 A kind of multi-level important email detection method

Similar Documents

Publication Publication Date Title
CN1240011C (en) File classifying management system and method for operation system
CN1790405A (en) Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN101059796A (en) Two-stage combined file classification method based on probability subject
CN1904886A (en) Method and apparatus for establishing link structure between multiple documents
CN101079850A (en) Email processing method and device
CN1750002A (en) Method for providing research result
CN101047656A (en) Method for implementing E-mail quickly transmitting and its system
CN1543150A (en) Packet classification apparatus and method using field level tries
CN101079072A (en) Text clustering element study method and device
CN1889108A (en) Method of identifying junk mail
CN1750030A (en) Method for filtering junk nails
CN101075260A (en) Method and module for extracting summary
CN1967561A (en) Method for making gender recognition handler, method and device for gender recognition
CN101046858A (en) Electronic information comparing system and method and anti-garbage mail system
CN1916940A (en) Template optimized character recognition method and system
CN101604394A (en) Increment study classification method under a kind of limited storage resources
Lam et al. Learning good prototypes for classification using filtering and abstraction of instances
CN101046809A (en) New word identification method based on association rule model
CN102567529B (en) Cross-language text classification method based on two-view active learning technology
WO2019238769A1 (en) Content analysis
CN1614607A (en) Filtering method and system for e-mail refuse
CN1214362C (en) Device and method for determining coretative coefficient between signals and signal sectional distance
CN1488119A (en) Resolution enhancement by nearest neighbor classified filtering
CN104142997A (en) Bayes text classifier based on reverse word frequency

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090306

Address after: 7-58 mailbox, Beihang University, 37 Xueyuan Road, Beijing, Haidian District: 100191

Applicant after: Beihang University

Address before: Beijing, Xueyuan Road, Haidian District No. 35, Nanjing building, 16 floor, Germany and China, postcode: 100083

Applicant before: Qian Depei

ASS Succession or assignment of patent right

Owner name: BEIJING UNIV. OF AERONAUTICS + ASTRONAUTICS

Free format text: FORMER OWNER: QIAN DEPEI

Effective date: 20090306

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication