CN102255922A - Intelligent multilevel junk email filtering method - Google Patents

Intelligent multilevel junk email filtering method Download PDF

Info

Publication number
CN102255922A
CN102255922A CN201110247504XA CN201110247504A CN102255922A CN 102255922 A CN102255922 A CN 102255922A CN 201110247504X A CN201110247504X A CN 201110247504XA CN 201110247504 A CN201110247504 A CN 201110247504A CN 102255922 A CN102255922 A CN 102255922A
Authority
CN
China
Prior art keywords
mail
spam
filtering
email
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110247504XA
Other languages
Chinese (zh)
Inventor
刘培玉
朱振方
杨玉珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201110247504XA priority Critical patent/CN102255922A/en
Publication of CN102255922A publication Critical patent/CN102255922A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an intelligent multilevel junk email filtering method. By the method, a conventional information gain algorithm is improved by utilizing the distribution information of characteristic items and dependence on data in a system training process is reduced, thereby improving the email content analysis capability of a system; a misjudgment rate for normal emails is reduced and the problem of email content semantic information loss is solved; and a weighted-support-vector-machine-based classification method is put forward for the problem of normal email misjudgment in a junk email filtering process, and in the method, two classes of email class weights and weights reflecting the importance of each email are added, and training is performed by utilizing a support vector machine classifier to obtain a junk email filter. In the method, an intelligent multilevel junk email filtering platform is constructed by integrating a plurality of filtering technologies such as Internet protocol (IP) address and domain name system (DNS) blacklists, subject and attachment keyword filtering, email body content filtering, attachment text content filtering and the like.

Description

A kind of multi-level spam intelligently filters method
Technical field
The invention belongs to areas of information technology, relate to the classification and the filtration of spam, relate in particular to a kind of multi-level spam intelligently filters method.
Background technology
Along with Internet development is popularized, life brings great convenience Email to people's work, and meanwhile self-invited spam has also produced great puzzlement to people.Spreading unchecked of spam not only takies a large amount of bandwidth, serious waste Internet resources, spam is also just becoming the target of assault, the approach that virus is propagated, and brings great potential safety hazard thus.
The present definition still unified clear and definite definition of neither one in the world for spam, although with spam definition be usually: Unsolicited Bulk Email (UBE, not requested bulk mail) or Unsolicited Commercial Email (UCE, not requested commercial papers), this is that also this just reason has determined most Spam filtering instrument inefficiency on the market because same mail its judged result possibility for different users is different.
Solve the course that spam spreads unchecked, be broadly divided into following three phases:
(1) phase I mainly is to carry out spam by IP filtration, black and white lists, keyword coupling etc. to judge.
(2) second stage mainly is by filtering such as the intelligent content of statistic algorithms such as Bayes based on some and mechanism such as RBL filtration is finished the judgement of spam.
(3) the 3rd mainly add up the transmission behavior of spam and grow up.This mode is at first added up, is analyzed and calculate a large amount of spam samples, sets up the model of cognition that spam sends behavior according to the RFC822 agreement then.Whether thereby just can judge this mail in mail transport agent (MTA) stage of communication is spam.This mode effectively raises the speed of filtrating mail, has reduced network delay, yet the filtrating mail of this behavioural characteristic Network Based but seems powerless for single spam processing.
There is following some problem in the generally speaking current Spam filtering system:
(1) normal email erroneous judgement problem
For the user, normal email generally is extremely important, and big multi-user would rather be led to all mails to read through and also is reluctant to filter out a normal email.Therefore, for the Spam filtering system, emphasis is considered is not one and looks into full problem, and should be to look into accurate problem.And the too much consideration of mail filtering system at present mostly look into full problem, be IP or dynamic IP level, caused erroneous judgement and filter rank also mostly user's normal email.
(2) semantic information is lost problem
The statistical properties of being absorbed in IP address filtering and mail of present filtration system have but been ignored the excavation to the mail semantic information more.Yet spam could be judged its legitimacy usually by the disguise as normal email when having only its content of parsing, in this case, only depends on single IP address filtering and statistical property to be difficult to obtain promising result.Therefore be necessary the semantic information of mail is excavated, thus the accuracy of raising mail filtering system.
(3) lack the Spam filtering total solution
Because problem (1), (2) are as can be known, only depending on a kind of technological means is to be difficult to obtain satisfied filter effect.Therefore, be necessary various technological means are integrated, give full play to the strong point of various filtering techniques, to avoid the limitation of monotechnics.And the present just mail filtering system of the filtration solution of this overall situation is lacked.
Summary of the invention
Purpose of the present invention is exactly in order to address the above problem, a kind of multi-level spam intelligently filters method is provided, and its purpose is to reduce the False Rate of normal email, solves the problem that the Mail Contents semantic information is lost, thereby commander's overall situation has made up a perfect Spam filtering system.
To achieve these goals, the present invention adopts following technical scheme:
A kind of multi-level spam intelligently filters method, the filtration step of this method is as follows:
Step1: the mail server listening port, judge it is smtp agreement or pop3 agreement according to port;
Step2: the smtp agreement then changes step3 continuation execution in this way; The pop3 agreement then changes the pop3 protocol process module over to and handles in this way;
Step3: change smtp agreement receiver module over to, and extract the relevant information of mail;
Step4: for the e-mail messages that extracts, at first carry out black and white lists and filter,, continue to carry out otherwise change step5 over to as in blacklist, then abandoning;
Step5: filter according to the mail keyword then;
Step6: secondly Mail Contents is judged, handled according to result of determination; Spam then abandons in this way, continues to carry out otherwise change step7 over to;
Step7: judgement is purpose mailbox or local mailbox, and local in this way mailbox then enters local mailbox and delivers and mail management, otherwise then transmits.
Among the described step4, the process of carrying out the black and white lists filtration is as follows: at first the IP address of mail is tentatively filtered, as the IP address in white list, then being judged to be legitimate mail receives, otherwise judge that the IP address is whether in blacklist, as then being judged to be spam, and abandon, otherwise dns address is mated, as the match is successful then is judged to be legitimate mail and receives with the DNS white list, otherwise mate, as the match is successful with the DNS blacklist, then be judged to be spam and abandon, otherwise the mail matter topics keyword is mated.
Among the described step6, the process that Mail Contents is judged is as follows:
Step 1: at first extract the message body part, and message body is cut speech;
Step 2: carry out preliminary treatment to cutting the speech result;
Step 3: pretreated mail is carried out feature selecting;
Step 4: the characteristic use SVMs SVM that extracts is classified;
Step5: classification results is judged legitimate mail then receives in this way, then delivered and need the user to carry out feedback information that spam then abandons in this way as doubtful spam.
Described pretreated process is as follows: at first the cutting result is carried out the semanteme reduction, it mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word; Adopt stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word then.
The described process of utilizing SVMs SVM to classify is as follows:
(1) extracts the mail text feature;
(2) calculate other relativity measurement of feature class of each feature;
(3) utilize word sequence nuclear as authorizing function training SVMs;
(4) utilize the classification calculation of correlation to calculate the decay factor of speech;
(5) mail is classified.
Beneficial effect of the present invention:
1. the present invention has improved the information gain algorithm in the traditional characteristic selection
Process in the data training is many based on the balance language material, and in true environment, the situation of language material balance but is difficult to deposit.And its essence of rubbish postal filtration is one two classification problem, and therefore, whole filtering result has stronger dependence to the balance of language material.At this situation, the present invention utilizes the distributed intelligence of characteristic item to improve traditional information gain algorithm, has reduced in the systematic training process dependence to data, thereby has improved the analysis ability of system to Mail Contents.
2. the present invention has constructed a kind of text semantic representation model that is suitable for Spam filtering
Traditional vector space model is with the separate prerequisite that is assumed to be between each characteristic item, thereby this model has been ignored the semantic relation between information, this makes and has the mechanicalness defective in the filter process, therefore, natural language processing technique is incorporated in the vector space model, and to being organized combing mutually between each characteristic item, allow to embody and filter connecting each other between this paper feature speech, improve the accuracy of filtering.
3. the present invention proposes a kind of rubbish mail filtering method based on the weighting SVMs
Based on the rubbish mail filtering method of weighting SVMs, mainly be to propose at the problem of Spam filtering process normal email erroneous judgement.This method has increased by two class mail classes weights and has reflected the weight of every envelope mail importance, utilizes support vector machine classifier to train then, obtains twit filter.
4. the present invention proposes a kind of word sequence nuclear based on the classification calculation of correlation
Utilize SVMs to classify, usually ignore text structure and cause losing a large amount of semantic informations and lose.At this phenomenon, the present invention proposes a kind of word sequence nuclear based on the classification calculation of correlation.Implementation step is as follows:
(1) extracts the mail text feature.
(2) calculate other relativity measurement of feature class of each feature.
(3) utilize word sequence nuclear as authorizing function training SVMs.
(4) utilize the classification calculation of correlation to calculate the decay factor of speech
(5) mail is classified.
5. the present invention is incorporated into feedback and self-study mechanism in the Spam filtering template
Because Mail Contents is dynamic change, so training book also should be brought in constant renewal in along with the operation of system.Because different training samples is different to the contribution degree of mail filtering system, gives certain weight therefore should for each sample in the sample space, and in whole filter process, dynamically adjust sample weights according to filter effect.The purpose of doing like this can effectively keep the big sample of system's contribution, and reduces the interference that the low sample of some contribution degree brings.
6. the present invention has finally built a multi-level spam intelligently filters platform.
The present invention gather IP address and DNS blacklist, to the keyword of theme and annex filter, multiple filtering techniques such as message body information filtering and annex text content filtering, made up a multi-level spam intelligently filters platform.
Description of drawings
Fig. 1 is a filter method flow chart of the present invention;
Fig. 2 is based on the Spam filtering flow chart of content;
Fig. 3 is the feedback procedure flow chart.
Embodiment
The invention will be further described below in conjunction with accompanying drawing and embodiment.
The present invention adopts multiple rubbish mail filtering method, and these methods adopt certain sequence to filter spam, forms a multi-level organic whole that filters.Fig. 1 has described Spam filtering process flow diagram of the present invention.After an envelope mail reception was come, filtering module filtered in the following order:
(1) sees that at first whether the IP address is at white list.If have, just be judged to be normal email.Then do not proceed in proper order according to the back filtration.
(2) Match IP Address blacklist.If have, then be spam.Otherwise, proceed according to filtering process again.
(3) coupling DNS white list.The match is successful, then is judged to be legitimate mail, is transferred to local delivery or forwarding module.Otherwise, proceed according to filtering process again.
(4) coupling DNS blacklist.The match is successful, is judged to be spam, abandons.Otherwise, proceed according to filtering process.
(5) coupling mail matter topics keyword.Success illustrates and contains illegal keyword in the mail matter topics that this mail is a spam, abandons.Otherwise, proceed according to filtering process.
(6) if annex is arranged, coupling annex name keyword.Success illustrates and contains illegal keyword in the Attachment Name, judges that this mail is a spam, abandons.Otherwise, proceed according to filtering process.
(7) the annex body matter is judged.If content is judged to be spam, then this mail is also delivered, in user's certain hour, do not handle, be used as spam and delete.
(8) if the text annex is arranged, attachment content is filtered.If attachment content is judged to be spam by the content determination module, is judged as spam as text and handles.
Core of the present invention is the information filtering stage shown in Figure 2 just.In the information filtering stage, at first propose the message body part, and message body is cut speech.Because it is not perfect that system made sure to keep in mind in Chinese, therefore,, therefore, be necessary the cutting result is carried out the semanteme reduction through the lost part of Mail Contents information meeting later semanteme.
(1) pretreatment stage:
Preprocessing process of the present invention comprises two stages.Phase I is cutting result's a semantic reduction phase; Second stage is for removing the stop words stage.
(a) semantic reduction phase
This stage mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word, and its basic process is as follows:
The identification of basic phrase is a text that the input participle marked, the process of the phrase text that output identifies.The feature of input is made up of two parts, and a part is a condition, another part rule, the action that carry out the back that promptly satisfies condition.Therefore, we merge template by formulating basic phrase condition for identification template with rule, utilize the best basic phrase of maximum informational entropy identification at last.
Consider that Chinese is that a kind of meaning is closed language, word order has bigger influence to Chinese semanteme, and the Chinese style of writing modes that adopt from left to right more, and centre word is positioned at back one speech mostly, therefore adopt mode from back to front in the phrase identifying, the i.e. mode of falling row, the present invention here selects " stack " structure as the storage data for use.
Because therefore speech and context dependent in the statement need to consider information such as current speech, front and back speech, part of speech and speech syllable number.
Therefore, according to influencing the factor that phrase constitutes, the defined feature space is:
1. part of speech information.The part of speech of each two speech of current speech and front and back;
2. speech.Some that before and after the current speech current speech structure phrase are impacted have the word of specific use.As " ", " " wait some function words.
3. mark classification.Mark the classification that current speech should belong to, we are defined as noun phrase class and two classifications of verb phrase class.
4. syllable number.Consider the syllable number of each speech of current speech and front and back.For fear of the sparse property of data, mostly phrase is when merging that two speech merge, and when three speech phrases merged, emphasis was considered monosyllabic.
5. punctuate.To some specific punctuates of impacting of structure phrase, as ", ".
According to above-mentioned feature space definition condition for identification, in the formulation process of basic phrase condition for identification, we have defined the condition template, and are as shown in table 1.
Table 1 characteristic condition template
Figure BDA0000086322290000081
Figure BDA0000086322290000091
When characteristic function was got particular value, this condition template was obtained concrete feature by instantiation." Modern Chinese corpus processing---word segmentation and the part-of-speech tagging standard " that part-of-speech tagging adopts Beijing University's computational language to be formulated, for as " ", " ", " ", " with " wait the special speech of some border property signs, we draft a border vocabulary in advance, are used for the identification of phrasal boundary; For the border of better recognition phrase, I draft a border part of speech table in addition, comprise some parts of speech such as conjunction, punctuate.
With by the characteristic condition template after the instantiation as Rule of judgment, judge whether input is satisfied phrase and merged rule (it is as shown in table 2 that part merges rule), satisfied then carry out phrase and merge, otherwise carry out next step judgement, whole like this matching process, be converted into the two-value assorting process, this feature can be expressed as two-value characteristic function form.As article one rule two-value characteristic function in the table 2 be:
Figure BDA0000086322290000092
Table 2 partial phrase merges rule
Figure BDA0000086322290000093
(b) remove stop words
Here the removal method of stop words, the present invention adopts stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word, and stop words.
(2) feature selecting
The purpose of feature selecting is exactly to continue to carry out feature extraction to remove step result afterwards through stop words, thereby extracts the characteristic item of representing mail features, not only and remove those and do not have separating capacity, bring the item of interference on the contrary to e-mail analysis.The present invention mainly utilizes the information gain algorithm to carry out feature selecting.Yet, but exist a large amount of language material energy imbalances in the true experimental situation, and filtrating mail is one two classification problem, to some stronger dependence of balance of language material, therefore the present invention is directed to this situation the information gain algorithm is improved:
(a) dispersion between the class of characteristic item
According to knowledge of statistics, average is exactly simple mean deviation amount, is the distance that independent side-play amount departs from average and variance is weighed, average and variance can the discipline of characterization corpus and document between range distribution.Formula (2) can be used for estimate variance:
S 2 = Σ i = 1 n ( d i - d ‾ ) 2 n - 1 - - - ( 2 )
N is the sum of classification in the language material in the formula,
Figure BDA0000086322290000102
Be the classification mean value of characteristic item t, d iSum frequency for characteristic item t.
Here we can use the word frequency information tf of characteristic item iReplace the d in the formula (2) i, utilize average word frequency information
Figure BDA0000086322290000103
Improve the correlation computations method.Therefore, dispersion (Distribution Information Among a Class) just can be expressed as formula (3) between the class of characteristic item
DIac = [ Σ i n ( tf i ( t ) - tf ( t ) ‾ ) 2 ] / n - 1 tf ( t ) - - - ( 3 )
We utilize DIac=S to represent the distribution of a t between of all categories, and as seen, when item t only all occurred in all texts a classification, DIac obtained maximum, and this moment E (t),
Figure BDA0000086322290000105
Value be 0, IG (D t) gets maximum, this moment t discrimination the strongest; When the TF homogeneous phase while of t in each classification, DIac obtains minimum value, and this moment, the discrimination of t was the most weak.
(b) dispersion in the class of characteristic item
In like manner, we can utilize variance to describe distributed intelligence in the class of characteristic item t, are called for short dispersion DIic (Distribution Information Inside a Class) in the class.Its computing formula is shown in (4):
DIic = [ Σ j m ( tf j ( t ) - t f ′ ( t ) ‾ ) 2 ] / m - 1 t f ′ ( t ) - - - ( 4 )
Wherein, tf j(t) represent the frequency that t occurs in a j piece of writing, n is total number of files in the class.
Figure BDA0000086322290000112
Be the mean value of item t occurrence frequency in each piece document, tf ' (t) represents total frequency in each piece of t document.By formula (4) as can be known, when DIic all occurred in all documents in this classification, DIic obtained minimum value, and this moment, the separating capacity of characteristic item t was the strongest, and the size and the classification capacity of visible DIic value are inversely proportional to.
(c) improved information gain algorithm
The size of the value of dispersion DIic has exactly disclosed the state whether characteristic item is in balance in the characteristic item class as the above analysis.It is more inhomogeneous to distribute in the big more characterization more of the DIic value Xiang Zaiben class, and when t only occurred in one piece of document a classification, DIic obtained maximum 1, and the distribution of t this moment in this class is highly unbalanced; The class distributed intelligence that dispersion (DIac) has disclosed characteristic item between the class of characteristic item simultaneously, the class of characteristic item distribute more uneven, and the value of dispersion is littler between class.
The class that exists height is uneven uneven with item if characteristic item distributes, be that characteristic item t only highly occurs that (at this moment t is still a bigger value about the information gain of its classification in several pieces of documents in a certain classification, such characteristic item is selecteed, be not desired result) DIic this moment bigger value often, the DIac value is less simultaneously.So our facility uses DIic>DIac as Rule of judgment, the information gain formula third part of giving with good conditionsi is added a penalty term DIac, the negative effect of situation does not appear with balance characteristics, thus reduce be chosen in the classification occurrence number few and in other classifications the more characteristic probability of occurrence number.The information gain of increase penalty factor (Gain Distribution Information, GDI) formula is as follows:
GDI ( D , t ) = - Σ i = 1 m P ( C i ) log P ( C i ) + P ( t ) Σ i = 1 m P ( C i / t ) log P ( C i | t ) - - - ( 5 )
+ P ( t ‾ ) Σ i = 1 m P ( C i | t ‾ ) log P ( C i | t ‾ ) × DIac
Because formula (5) is to a kind of punishment of situation under the uneven situation of feature the characteristic item item not occurred, yet but be not suitable for the comparatively situation of balance of characteristic distribution, therefore, consider to utilize DIic as Rule of judgment traditional information gain algorithm (IG) to be combined with GDI, form new feature selecting algorithm IG-GDI with DIac.Thereby when overcoming conventional information gain defective, kept its advantage.Algorithm flow is as follows:
Input: the number of all features of v-
The character subset that output: F-selects
With DF deletion low frequency word
Calculate all characteristic item DI Ic
Calculate DI Ac
If DI Ic>DI Ac
Carry out formula (5) GDI
Otherwise
Carry out IG
If IG or GDI value are bigger
T is joined among the set F
(3) utilize svm classifier
(a) weighting SVMs filtering model
(Weighted Support Vector Machines WSVM) is generally used for solving classification problem under the uneven situation of sample to the weighting SVMs.It is considered herein that the great disparity except sample size of all categories, the significance level difference of classification also can cause the imbalance of sample.Be directed to the Spam filtering problem, the normal email significance level is obviously than spam significance level height.When guaranteeing nicety of grading, should avoid erroneous judgement, so its essence of filtrating mail problem also is the unbalanced classification problem of sample to normal email as far as possible.
Being provided with the mail training sample set is expressed as follows:
( x → 1 , y 1 ) , ( x → 2 , y 2 ) , . . . , ( x → l , y l ) , x → i ∈ R n , y i ∈ { - 1 , + 1 } - - - ( 6 )
Wherein
Figure BDA0000086322290000132
The vector of representing i envelope mail.y iBe key words sorting, y i=1 to represent i envelope mail be normal email, y i=-1 to represent i envelope mail be spam.Spam filtering model representation based on the weighting SVMs is as follows:
Min 1 2 | | w → | | 2 + Cσ Σ i = 1 l s i ξ i - - - ( 7 )
s . t . y i ( w → T Φ ( x → i ) + b ) ≥ 1 - ξ i
ξ wherein i〉=0, i=1,2..., l,
Figure BDA0000086322290000135
Being kernel function, based on the svm classifier device of basic kernel function radially spam being had filter effect preferably, be the kernel function of often using, so this paper selects radially basic kernel function for use
Figure BDA0000086322290000136
s iIf>0 expression sample importance weight is 0<s i<1 expression sample Inessential; If s i=1 expression
Figure BDA0000086322290000138
Generally important; If s i>1 expression
Figure BDA0000086322290000139
Important.The sample class weights are σ 〉=1, and the sample that belongs to identical category has identical classification weights.The weighting SVMs is compared with the standard SVMs, and the most outstanding advantage has been its obfuscation to the wrong punishment that divides of sample, promptly the slack variable of each sample be multiply by the importance weight and the classification weights of sample correspondence.
As follows to formula (7) structure Lagrange function:
Φ ( w → , b , α ) = 1 2 | | w → | | 2 + Cσ Σ i = 1 l s i ξ i - Σ i = 1 l α i ( y i ( w → T Φ ( x → i ) + b ) - 1 + ξ i ) - Σ i = 1 l β i ξ i - - - ( 8 )
α wherein i, β iBe the Lagrange multiplier, order
∂ Φ ∂ w = w → - Σ i = 1 l α i y i Φ ( x → i ) = 0 - - - ( 9 )
∂ Φ ∂ b = - Σ i = 1 l α i y i = 0 - - - ( 10 )
∂ Φ ∂ ξ i = σ s i C - α i - β i = 0 - - - ( 11 )
With formula (9)-(11) substitution Lagrange function, then the dual problem of the optimal solution problem of this weighting SVMs is:
Max α - 1 2 Σ i = 1 l Σ j = 1 l α i α j y i y j K ( x → i x → j ) + Σ i = 1 l α i
s . t . Σ i = 1 l α i y i = 0
0 ≤ α i ≤ σC s i , i = 1,2 , . . . , l - - - ( 12 )
Find the solution quadratic programming formula (12) and obtain Lagrange coefficient optimal solution
Figure BDA0000086322290000147
Substitution obtains:
w → * = Σ i = 1 l α i * y i K ( x → i , x → j ) - - - ( 13 )
b * = y i - Σ i = 1 l y i α i * y i K ( x → i , x → j ) - - - ( 14 )
Obtain optimum classifier at last:
f ( x → j ) = sgn ( Σ i = 1 l y i a i * K ( x → i , x → j ) + b * ) - - - ( 15 )
Wherein,
Figure BDA00000863222900001411
The upper bound that can draw the Lagrange coefficient is along with sample changes the difference of the significance level of classification and affiliated classification.
(b) examine based on the word sequence of the classification degree of correlation
Current research ubiquity at the Spam filtering problem is ignored the mail text structure and the problem that causes a large amount of semantic informations to be lost, and this directly causes filter algorithm to be difficult to excavate the semantic information that best embodies mail essence.
The present invention is directed to the problems referred to above, proposed the word sequence nuclear of a kind calculation of correlation, and be applied to the SVM Spam filtering.At first extract the classification calculation of correlation of mail text feature and calculated characteristics, utilize word sequence nuclear then, utilize the classification calculation of correlation to calculate the decay factor of speech in the training process, at last mail is classified as kernel function training SVMs.
1. word sequence is examined
Make ∑ represent an a limited number of set of words, s=s 1s 2S | s|Represent any word sequence, wherein the length of s is | s|, suppose i=[i again 1..., i n] be illustrated in the call number among the word sequence s, 1≤i 1≤ ... ≤ i n≤ | s|, order
Figure BDA0000086322290000151
Be illustrated in subsequence among the s, exist in this subsequence
Figure BDA0000086322290000152
With May not adjacent speech in word sequence s, obvious s[i] the ∈ ∑ nFor length be | μ | subsequence μ=μ 1... μ | μ |, suppose in word sequence s and find s[i], satisfy μ=s[i], also be
Figure BDA0000086322290000154
The length that makes l (i) expression subsequence μ in s, be grown up to, then l (i)=i | μ |-i1+1, i | μ |Expression μ | μ |Corresponding call number, then the word sequence nuclear between two word sequence s and the t is:
K wn ( s , t ) = Σ μ ∈ Σ n Σ i : μ = s [ i ] Σ j : μ = t [ j ] Π 1 ≤ j ≤ | μ | λ m , u j 2 Π i 1 ≤ k ≤ i | μ | , k ∉ i λ g , s k Π j 1 ≤ l ≤ j | μ | , l ∉ j λ g , t l - - - ( 16 )
Wherein, λ is the attenuation coefficient of the discontinuous subsequence of punishment, λ M, xAnd λ G, xTwo attenuation coefficients that expression is set same speech x, λ M, xRefer to the attenuation coefficient when speech x is match point, λ G, xRefer to the attenuation coefficient when speech x is gap point, do not have necessary relation between the two.For example, when calculating above-mentioned kernel function, establish λ m=1, with regard to expression the speech as match point is not punished so; Make λ g=0, then be illustrated in and do not allow to occur non-continuous series in the matching process.
2. the introducing of classification calculation of correlation
Consider that the punishment to each vocabulary should be relevant with its class discrimination ability, the speech that separating capacity is big should be greatly as the attenuation coefficient of match point, and the attenuation coefficient during as spaced points should be smaller, and no matter the little speech of class discrimination ability is as match point or spaced points, its role is all answered relative equilibrium, this paper introduces the classification calculation of correlation (Dependence Measure DP) as the foundation of calculating attenuation coefficient, provides concrete computational methods below for this reason.
If L represents the normal email collection, S represents spam collection, t kBe certain feature speech, think so
DP(t k)=(p(t k|L)-p(t k|S)) 2 (17)
Be t kThe classification calculation of correlation.
Wherein, p (t k| L) and p (t k| when S) being illustrated respectively in known class L or S, t kThe probability that occurs.And if only if t kWhen the probability that occurs in positive and negative two class texts equates, DP (t k) obtain minimum value 0; And if only if t kAnd when only in the text of a classification, occurring, DP (t k) reach maximum 1.t kThe match point attenuation coefficient
Figure BDA0000086322290000161
Be directly proportional DP (t with its classification calculation of correlation k) about variable p (t k| L)-p (t k| S) be the parabola that opening makes progress, when | p (t k| L)-p (t k| S) | in 0 epsilon neighborhood during value, DP (t k) value is very little and variation is comparatively smooth, this is unfavorable for the calculating of kernel function similarity.Therefore with classification calculation of correlation DP (t k) when introducing kernel function, adopt following computing formula:
λ m , t k = | p ( t k | L ) - p ( t k | S ) | - - - ( 18 )
Can use in the computational process with frequency replaces the mode of probability to estimate above-mentioned probable value:
p ( t k | s ) = # ( t k in c ) # ( * in c ) , c ∈ { L , S } - - - ( 19 )
Wherein, # (t kIn c) expression t kThe number of times that occurs in c, wherein c represents L or S.Can think:
λ g , t k = 1 - | p ( t k | L ) - p ( t k | S ) | - - - ( 20 )
In addition, be the very little problem of avoiding causing of similarity, utilize the linear behavio(u)r of kernel function to do following adjustment because of sparse property:
K n ( s , t ) = Σ i = 1 n μ 1 - i K ^ i ( s , t ) - - - ( 21 )
Wherein, the exponential function of μ is as weight coefficient,
Figure BDA0000086322290000172
Be through the nuclear of the word sequence after the standardization:
K ^ i ( s , t ) = K i ( s , t ) K i ( s , s ) K i ( t , t ) - - - ( 22 )
Through above-mentioned steps, realized classification to e-mail messages, the present invention here is divided into three classifications to Mail Contents, that is: normal email, spam and doubtful spam.For normal email, the present invention gives normal delivery, and with Direct Filtration, and for doubtful spam, to be on the safe side, the present invention delivers, and needs doubtful spam information of field feedback here for spam.
Mail feedback: (1) pseudo-feedback fine setting class template.In system's running, extract some very typical documents.With these document fine setting class templates.Make class template more near real class template, improve the accuracy and the availability of filtering.
The present invention proposes a new formula that on-the-fly modifies weight:
Pi new=αPi old+βID (23)
Wherein, Pi NewBe that the i classification is filtered a certain document and amended template vector, Pi OldIt is the template vector before the i classification is filtered a certain document, α is the old template vector of representative shared ratio in modification process, β is the modification factor of this document vector in revising old template procedure, and alpha+beta=1, α=0.95 that this paper adopts, β=0.05, D is the document vector, I is a linear critical value function, is defined as follows:
I = 0 . . . if ( &theta; doc < &theta; ) 1 . . . if ( &theta; doc &GreaterEqual; &theta; ) - - - ( 24 )
θ is a critical value, and we are divided into 2 classes substantially to all documents, for one piece of document, by cutting speech, feature selecting is transformed into a document vectors, by the similarity comparison, can be divided into a certain class wherein, the similarity value of this document and this class vector is θ Doc, if θ DocMore than or equal to we given threshold value θ, revise the class template vector, otherwise do not revise.In order to reduce unnecessary feedback as far as possible, need to select a suitable threshold, determining of threshold value is a quite complicated process, we will be by doing a large amount of experiments, rule of thumb value.This paper gets threshold value θ=0.07 through doing a large amount of experiments.
(2) show feedback.
Our mailing system collection server and client are one.So we also adopt explicit feedback to improve filter effect when adopting pseudo-feedback.At the web interface, the user reports for some relatively more typical spam.We are put into such spam under the specific feedback training file.After a specific time, we just start our feedback training correction class template.The feedback training process is as follows: at first, according to certain rule, delete some underproof report mails (too short by the report mail such as what have, as only to have only one to two row).Then, for all remaining feedback documents, we generate a vector according to class template.Utilize this vector to finely tune our class template according to pseudo-feedback algorithm formula; Make our class template approach real class template more.
If the initial training collection is T, the feedback document sets is M; Feedback algorithm is described below:
The first step: after the training algorithm training, generate feature vocabulary Tl of all categories (l is the feature item number, comprises feature speech, feature speech weight, word frequency, document frequency in the table), class conditional probability table at training set, generate preliminary classification device C.
Second step: set feedback threshold, use the preliminary classification device that test set is classified, similarity is collected such other feedback document sets greater than the document of classification feedback threshold.
The 3rd step: train feedback document sets of all categories to generate optimal characteristics vocabulary Mt (t is the feature item number) with improved feature selection approach, the information such as number of files of all categories that the statistics feedback is concentrated.
The 4th step: according to each classification list item of feedback collection generation, for grader C regenerates the class conditional probability table of new category feature vocabulary Pn (n is preceding n the feature of feature item number size for weight maximum in l+t the characteristic item), classification prior probability table and each characteristic item to revise grader C.
Though above-mentionedly in conjunction with the accompanying drawings the specific embodiment of the present invention is described; but be not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims (5)

1. multi-level spam intelligently filters method is characterized in that the filtration step of this method is as follows:
Step1: the mail server listening port, judge it is smtp agreement or pop3 agreement according to port;
Step2: the smtp agreement then changes step3 continuation execution in this way; The pop3 agreement then changes the pop3 protocol process module over to and handles in this way;
Step3: change smtp agreement receiver module over to, and extract the relevant information of mail;
Step4: for the e-mail messages that extracts, at first carry out black and white lists and filter,, continue to carry out otherwise change step5 over to as in blacklist, then abandoning;
Step5: filter according to the mail keyword then;
Step6: secondly Mail Contents is judged, handled according to result of determination; Spam then abandons in this way, continues to carry out otherwise change step7 over to;
Step7: judgement is purpose mailbox or local mailbox, and local in this way mailbox then enters local mailbox and delivers and mail management, otherwise then transmits.
2. as claims 1 described a kind of multi-level spam intelligently filters method, it is characterized in that, among the described step4, the process of carrying out the black and white lists filtration is as follows: at first the IP address of mail is tentatively filtered, as the IP address in white list, then being judged to be legitimate mail receives, otherwise judge the IP address whether in blacklist, as then being judged to be spam, and abandon, otherwise dns address is mated, as the match is successful then is judged to be legitimate mail and receives with the DNS white list, otherwise mate, as the match is successful with the DNS blacklist, then be judged to be spam and abandon, otherwise the mail matter topics keyword is mated.
3. as claims 1 described a kind of multi-level spam intelligently filters method, it is characterized in that among the described step6, the process that Mail Contents is judged is as follows:
Step 1: at first extract the message body part, and message body is cut speech;
Step 2: carry out preliminary treatment to cutting the speech result;
Step 3: pretreated mail is carried out feature selecting;
Step 4: the characteristic use SVMs SVM that extracts is classified;
Step5: classification results is judged legitimate mail then receives in this way, then delivered and need the user to carry out feedback information that spam then abandons in this way as doubtful spam.
4. as claims 3 described a kind of multi-level spam intelligently filters methods, it is characterized in that, described pretreated process is as follows: at first the cutting result is carried out the semanteme reduction, it mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word; Adopt stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word then.
5. as claims 3 described a kind of multi-level spam intelligently filters methods, it is characterized in that the described process of utilizing SVMs SVM to classify is as follows:
(1) extracts the mail text feature;
(2) calculate other relativity measurement of feature class of each feature;
(3) utilize word sequence nuclear as authorizing function training SVMs;
(4) utilize the classification calculation of correlation to calculate the decay factor of speech;
(5) mail is classified.
CN201110247504XA 2011-08-24 2011-08-24 Intelligent multilevel junk email filtering method Pending CN102255922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110247504XA CN102255922A (en) 2011-08-24 2011-08-24 Intelligent multilevel junk email filtering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110247504XA CN102255922A (en) 2011-08-24 2011-08-24 Intelligent multilevel junk email filtering method

Publications (1)

Publication Number Publication Date
CN102255922A true CN102255922A (en) 2011-11-23

Family

ID=44982917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110247504XA Pending CN102255922A (en) 2011-08-24 2011-08-24 Intelligent multilevel junk email filtering method

Country Status (1)

Country Link
CN (1) CN102255922A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609714A (en) * 2011-12-31 2012-07-25 哈尔滨理工大学 Novel classifier based on information gain and online support vector machine, and classification method thereof
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
CN103246655A (en) * 2012-02-03 2013-08-14 腾讯科技(深圳)有限公司 Text categorizing method, device and system
CN103347009A (en) * 2013-06-20 2013-10-09 新浪网技术(中国)有限公司 Method and device filtering information
CN104394158A (en) * 2014-12-01 2015-03-04 浪潮电子信息产业股份有限公司 Information security filtering method
CN104506426A (en) * 2012-03-23 2015-04-08 北京奇虎科技有限公司 Information prompting method and device for E-mails
CN105119910A (en) * 2015-07-23 2015-12-02 浙江大学 Template-based online social network rubbish information real-time detecting method
CN106330680A (en) * 2016-08-30 2017-01-11 黑龙江八农垦大学 Electronic mail cleaning method
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN106453423A (en) * 2016-12-08 2017-02-22 黑龙江大学 Spam filtering system and method based on user personalized setting
CN107579960A (en) * 2017-08-22 2018-01-12 深圳市盛路物联通讯技术有限公司 A kind of data filtering method and device
CN108683583A (en) * 2018-04-27 2018-10-19 北京顶象技术有限公司 A kind of Junk mail processing method, device and storage medium
CN109428946A (en) * 2017-08-31 2019-03-05 Abb瑞士股份有限公司 Method and system for Data Stream Processing
CN109523241A (en) * 2018-12-13 2019-03-26 杭州安恒信息技术股份有限公司 A kind of E-mail communication method for limiting and system
CN109753973A (en) * 2018-12-21 2019-05-14 西北工业大学 High spectrum image change detecting method based on Weighted Support Vector
CN109800433A (en) * 2019-01-24 2019-05-24 深圳市小满科技有限公司 Method, apparatus of filing, electronic equipment and medium based on two disaggregated model of mail
CN109935287A (en) * 2019-02-28 2019-06-25 生活空间(沈阳)数据技术服务有限公司 A kind of similarity analysis method, device and equipment of medical record information
CN109948033A (en) * 2017-09-04 2019-06-28 北京国双科技有限公司 A kind of vertical field source data filter method and device
CN110149266A (en) * 2018-07-19 2019-08-20 腾讯科技(北京)有限公司 Spam filtering method and device
CN110971619A (en) * 2020-01-02 2020-04-07 惠州学院 Network technology security system and method with bad information filtering processing
CN113159736A (en) * 2021-05-21 2021-07-23 北京天空卫士网络安全技术有限公司 Mailbox management method and device
CN113839950A (en) * 2021-09-27 2021-12-24 厦门天锐科技股份有限公司 Mail approval method and system based on terminal mail SMTP protocol

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005101770A1 (en) * 2004-04-05 2005-10-27 Hewlett-Packard Development Company L.P. Junk mail processing device and method thereof
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005101770A1 (en) * 2004-04-05 2005-10-27 Hewlett-Packard Development Company L.P. Junk mail processing device and method thereof
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈孝礼: "基于改进SVM的垃圾邮件过滤系统研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609714B (en) * 2011-12-31 2017-07-07 哈尔滨理工大学 Novel classification device and sorting technique based on information gain and Online SVM
CN102609714A (en) * 2011-12-31 2012-07-25 哈尔滨理工大学 Novel classifier based on information gain and online support vector machine, and classification method thereof
CN103246655A (en) * 2012-02-03 2013-08-14 腾讯科技(深圳)有限公司 Text categorizing method, device and system
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
WO2013139223A1 (en) * 2012-03-23 2013-09-26 北京奇虎科技有限公司 Method and device for prompting information about e-mail
CN104506426B (en) * 2012-03-23 2019-03-01 北京奇虎科技有限公司 The information cuing method and device of mail
CN104506426A (en) * 2012-03-23 2015-04-08 北京奇虎科技有限公司 Information prompting method and device for E-mails
CN103347009A (en) * 2013-06-20 2013-10-09 新浪网技术(中国)有限公司 Method and device filtering information
CN103347009B (en) * 2013-06-20 2016-09-28 新浪网技术(中国)有限公司 A kind of information filtering method and device
CN104394158A (en) * 2014-12-01 2015-03-04 浪潮电子信息产业股份有限公司 Information security filtering method
CN105119910A (en) * 2015-07-23 2015-12-02 浙江大学 Template-based online social network rubbish information real-time detecting method
CN106330680A (en) * 2016-08-30 2017-01-11 黑龙江八农垦大学 Electronic mail cleaning method
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN106453033B (en) * 2016-08-31 2019-03-15 电子科技大学 Multi-level process for sorting mailings based on Mail Contents
CN106453423A (en) * 2016-12-08 2017-02-22 黑龙江大学 Spam filtering system and method based on user personalized setting
CN106453423B (en) * 2016-12-08 2019-10-01 黑龙江大学 A kind of filtration system and method for the spam based on user individual setting
CN107579960A (en) * 2017-08-22 2018-01-12 深圳市盛路物联通讯技术有限公司 A kind of data filtering method and device
CN109428946A (en) * 2017-08-31 2019-03-05 Abb瑞士股份有限公司 Method and system for Data Stream Processing
CN109948033A (en) * 2017-09-04 2019-06-28 北京国双科技有限公司 A kind of vertical field source data filter method and device
CN108683583A (en) * 2018-04-27 2018-10-19 北京顶象技术有限公司 A kind of Junk mail processing method, device and storage medium
CN110149266A (en) * 2018-07-19 2019-08-20 腾讯科技(北京)有限公司 Spam filtering method and device
CN110149266B (en) * 2018-07-19 2022-06-24 腾讯科技(北京)有限公司 Junk mail identification method and device
CN109523241A (en) * 2018-12-13 2019-03-26 杭州安恒信息技术股份有限公司 A kind of E-mail communication method for limiting and system
CN109753973A (en) * 2018-12-21 2019-05-14 西北工业大学 High spectrum image change detecting method based on Weighted Support Vector
CN109800433A (en) * 2019-01-24 2019-05-24 深圳市小满科技有限公司 Method, apparatus of filing, electronic equipment and medium based on two disaggregated model of mail
CN109800433B (en) * 2019-01-24 2023-11-10 深圳市小满科技有限公司 Filing method and device based on mail two-class model, electronic equipment and medium
CN109935287A (en) * 2019-02-28 2019-06-25 生活空间(沈阳)数据技术服务有限公司 A kind of similarity analysis method, device and equipment of medical record information
CN110971619A (en) * 2020-01-02 2020-04-07 惠州学院 Network technology security system and method with bad information filtering processing
CN113159736A (en) * 2021-05-21 2021-07-23 北京天空卫士网络安全技术有限公司 Mailbox management method and device
CN113839950A (en) * 2021-09-27 2021-12-24 厦门天锐科技股份有限公司 Mail approval method and system based on terminal mail SMTP protocol
CN113839950B (en) * 2021-09-27 2023-06-27 厦门天锐科技股份有限公司 Mail approval method and system based on terminal mail SMTP protocol

Similar Documents

Publication Publication Date Title
CN102255922A (en) Intelligent multilevel junk email filtering method
Agarwal et al. Email spam detection using integrated approach of Naïve Bayes and particle swarm optimization
CN101408883B (en) Method for collecting network public feelings viewpoint
Mohamad et al. An evaluation on the efficiency of hybrid feature selection in spam email classification
CN101295381B (en) Junk mail detecting method
CN105871887B (en) Client-based individual electronic mail filtering system and filter method
Katirai et al. Filtering junk e-mail
CN101227435A (en) Method for filtering Chinese junk mail based on Logistic regression
CN1889108B (en) Method of identifying junk mail
Zhang et al. Filtering junk mail with a maximum entropy model
Li et al. Research and improvement of a spam filter based on naive Bayes
CN105117466A (en) Internet information screening system and method
Iyengar et al. Integrated spam detection for multilingual emails
Krause et al. Recognizing email spam from meta data only
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
Shyry et al. Detection and prevention of spam mail with semantics-based text classification of collaborative and content filtering
Balakumar et al. Ontology based classification and categorization of email
Kågström Improving naive bayesian spam filtering
CN102799666B (en) Method for automatically categorizing texts of network news based on frequent term set
Vahora et al. Novel approach: Naïve bayes with vector space model for spam classification
KR20130021956A (en) Method and apparatus for determining spam document
Gong et al. Research of spam filtering based on Bayesian algorithm
Yin et al. An improved bayesian algorithm for filtering spam e-mail

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20111123