CN102255922A

CN102255922A - Intelligent multilevel junk email filtering method

Info

Publication number: CN102255922A
Application number: CN201110247504XA
Authority: CN
Inventors: 刘培玉; 朱振方; 杨玉珍
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2011-08-24
Filing date: 2011-08-24
Publication date: 2011-11-23

Abstract

The invention discloses an intelligent multilevel junk email filtering method. By the method, a conventional information gain algorithm is improved by utilizing the distribution information of characteristic items and dependence on data in a system training process is reduced, thereby improving the email content analysis capability of a system; a misjudgment rate for normal emails is reduced and the problem of email content semantic information loss is solved; and a weighted-support-vector-machine-based classification method is put forward for the problem of normal email misjudgment in a junk email filtering process, and in the method, two classes of email class weights and weights reflecting the importance of each email are added, and training is performed by utilizing a support vector machine classifier to obtain a junk email filter. In the method, an intelligent multilevel junk email filtering platform is constructed by integrating a plurality of filtering technologies such as Internet protocol (IP) address and domain name system (DNS) blacklists, subject and attachment keyword filtering, email body content filtering, attachment text content filtering and the like.

Description

A kind of multi-level spam intelligently filters method

Technical field

The invention belongs to areas of information technology, relate to the classification and the filtration of spam, relate in particular to a kind of multi-level spam intelligently filters method.

Background technology

Along with Internet development is popularized, life brings great convenience Email to people's work, and meanwhile self-invited spam has also produced great puzzlement to people.Spreading unchecked of spam not only takies a large amount of bandwidth, serious waste Internet resources, spam is also just becoming the target of assault, the approach that virus is propagated, and brings great potential safety hazard thus.

The present definition still unified clear and definite definition of neither one in the world for spam, although with spam definition be usually: Unsolicited Bulk Email (UBE, not requested bulk mail) or Unsolicited Commercial Email (UCE, not requested commercial papers), this is that also this just reason has determined most Spam filtering instrument inefficiency on the market because same mail its judged result possibility for different users is different.

Solve the course that spam spreads unchecked, be broadly divided into following three phases:

(1) phase I mainly is to carry out spam by IP filtration, black and white lists, keyword coupling etc. to judge.

(2) second stage mainly is by filtering such as the intelligent content of statistic algorithms such as Bayes based on some and mechanism such as RBL filtration is finished the judgement of spam.

(3) the 3rd mainly add up the transmission behavior of spam and grow up.This mode is at first added up, is analyzed and calculate a large amount of spam samples, sets up the model of cognition that spam sends behavior according to the RFC822 agreement then.Whether thereby just can judge this mail in mail transport agent (MTA) stage of communication is spam.This mode effectively raises the speed of filtrating mail, has reduced network delay, yet the filtrating mail of this behavioural characteristic Network Based but seems powerless for single spam processing.

There is following some problem in the generally speaking current Spam filtering system:

(1) normal email erroneous judgement problem

For the user, normal email generally is extremely important, and big multi-user would rather be led to all mails to read through and also is reluctant to filter out a normal email.Therefore, for the Spam filtering system, emphasis is considered is not one and looks into full problem, and should be to look into accurate problem.And the too much consideration of mail filtering system at present mostly look into full problem, be IP or dynamic IP level, caused erroneous judgement and filter rank also mostly user's normal email.

(2) semantic information is lost problem

The statistical properties of being absorbed in IP address filtering and mail of present filtration system have but been ignored the excavation to the mail semantic information more.Yet spam could be judged its legitimacy usually by the disguise as normal email when having only its content of parsing, in this case, only depends on single IP address filtering and statistical property to be difficult to obtain promising result.Therefore be necessary the semantic information of mail is excavated, thus the accuracy of raising mail filtering system.

(3) lack the Spam filtering total solution

Because problem (1), (2) are as can be known, only depending on a kind of technological means is to be difficult to obtain satisfied filter effect.Therefore, be necessary various technological means are integrated, give full play to the strong point of various filtering techniques, to avoid the limitation of monotechnics.And the present just mail filtering system of the filtration solution of this overall situation is lacked.

Summary of the invention

Purpose of the present invention is exactly in order to address the above problem, a kind of multi-level spam intelligently filters method is provided, and its purpose is to reduce the False Rate of normal email, solves the problem that the Mail Contents semantic information is lost, thereby commander's overall situation has made up a perfect Spam filtering system.

To achieve these goals, the present invention adopts following technical scheme:

A kind of multi-level spam intelligently filters method, the filtration step of this method is as follows:

Step1: the mail server listening port, judge it is smtp agreement or pop3 agreement according to port;

Step2: the smtp agreement then changes step3 continuation execution in this way; The pop3 agreement then changes the pop3 protocol process module over to and handles in this way;

Step3: change smtp agreement receiver module over to, and extract the relevant information of mail;

Step4: for the e-mail messages that extracts, at first carry out black and white lists and filter,, continue to carry out otherwise change step5 over to as in blacklist, then abandoning;

Step5: filter according to the mail keyword then;

Step6: secondly Mail Contents is judged, handled according to result of determination; Spam then abandons in this way, continues to carry out otherwise change step7 over to;

Step7: judgement is purpose mailbox or local mailbox, and local in this way mailbox then enters local mailbox and delivers and mail management, otherwise then transmits.

Among the described step4, the process of carrying out the black and white lists filtration is as follows: at first the IP address of mail is tentatively filtered, as the IP address in white list, then being judged to be legitimate mail receives, otherwise judge that the IP address is whether in blacklist, as then being judged to be spam, and abandon, otherwise dns address is mated, as the match is successful then is judged to be legitimate mail and receives with the DNS white list, otherwise mate, as the match is successful with the DNS blacklist, then be judged to be spam and abandon, otherwise the mail matter topics keyword is mated.

Among the described step6, the process that Mail Contents is judged is as follows:

Step 1: at first extract the message body part, and message body is cut speech;

Step 2: carry out preliminary treatment to cutting the speech result;

Step 3: pretreated mail is carried out feature selecting;

Step 4: the characteristic use SVMs SVM that extracts is classified;

Step5: classification results is judged legitimate mail then receives in this way, then delivered and need the user to carry out feedback information that spam then abandons in this way as doubtful spam.

Described pretreated process is as follows: at first the cutting result is carried out the semanteme reduction, it mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word; Adopt stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word then.

The described process of utilizing SVMs SVM to classify is as follows:

(1) extracts the mail text feature;

(2) calculate other relativity measurement of feature class of each feature;

(3) utilize word sequence nuclear as authorizing function training SVMs;

(4) utilize the classification calculation of correlation to calculate the decay factor of speech;

(5) mail is classified.

Beneficial effect of the present invention:

1. the present invention has improved the information gain algorithm in the traditional characteristic selection

Process in the data training is many based on the balance language material, and in true environment, the situation of language material balance but is difficult to deposit.And its essence of rubbish postal filtration is one two classification problem, and therefore, whole filtering result has stronger dependence to the balance of language material.At this situation, the present invention utilizes the distributed intelligence of characteristic item to improve traditional information gain algorithm, has reduced in the systematic training process dependence to data, thereby has improved the analysis ability of system to Mail Contents.

2. the present invention has constructed a kind of text semantic representation model that is suitable for Spam filtering

Traditional vector space model is with the separate prerequisite that is assumed to be between each characteristic item, thereby this model has been ignored the semantic relation between information, this makes and has the mechanicalness defective in the filter process, therefore, natural language processing technique is incorporated in the vector space model, and to being organized combing mutually between each characteristic item, allow to embody and filter connecting each other between this paper feature speech, improve the accuracy of filtering.

3. the present invention proposes a kind of rubbish mail filtering method based on the weighting SVMs

Based on the rubbish mail filtering method of weighting SVMs, mainly be to propose at the problem of Spam filtering process normal email erroneous judgement.This method has increased by two class mail classes weights and has reflected the weight of every envelope mail importance, utilizes support vector machine classifier to train then, obtains twit filter.

4. the present invention proposes a kind of word sequence nuclear based on the classification calculation of correlation

Utilize SVMs to classify, usually ignore text structure and cause losing a large amount of semantic informations and lose.At this phenomenon, the present invention proposes a kind of word sequence nuclear based on the classification calculation of correlation.Implementation step is as follows:

(1) extracts the mail text feature.

(2) calculate other relativity measurement of feature class of each feature.

(3) utilize word sequence nuclear as authorizing function training SVMs.

(4) utilize the classification calculation of correlation to calculate the decay factor of speech

(5) mail is classified.

5. the present invention is incorporated into feedback and self-study mechanism in the Spam filtering template

Because Mail Contents is dynamic change, so training book also should be brought in constant renewal in along with the operation of system.Because different training samples is different to the contribution degree of mail filtering system, gives certain weight therefore should for each sample in the sample space, and in whole filter process, dynamically adjust sample weights according to filter effect.The purpose of doing like this can effectively keep the big sample of system's contribution, and reduces the interference that the low sample of some contribution degree brings.

6. the present invention has finally built a multi-level spam intelligently filters platform.

The present invention gather IP address and DNS blacklist, to the keyword of theme and annex filter, multiple filtering techniques such as message body information filtering and annex text content filtering, made up a multi-level spam intelligently filters platform.

Description of drawings

Fig. 1 is a filter method flow chart of the present invention;

Fig. 2 is based on the Spam filtering flow chart of content;

Fig. 3 is the feedback procedure flow chart.

Embodiment

The invention will be further described below in conjunction with accompanying drawing and embodiment.

The present invention adopts multiple rubbish mail filtering method, and these methods adopt certain sequence to filter spam, forms a multi-level organic whole that filters.Fig. 1 has described Spam filtering process flow diagram of the present invention.After an envelope mail reception was come, filtering module filtered in the following order:

(1) sees that at first whether the IP address is at white list.If have, just be judged to be normal email.Then do not proceed in proper order according to the back filtration.

(2) Match IP Address blacklist.If have, then be spam.Otherwise, proceed according to filtering process again.

(3) coupling DNS white list.The match is successful, then is judged to be legitimate mail, is transferred to local delivery or forwarding module.Otherwise, proceed according to filtering process again.

(4) coupling DNS blacklist.The match is successful, is judged to be spam, abandons.Otherwise, proceed according to filtering process.

(5) coupling mail matter topics keyword.Success illustrates and contains illegal keyword in the mail matter topics that this mail is a spam, abandons.Otherwise, proceed according to filtering process.

(6) if annex is arranged, coupling annex name keyword.Success illustrates and contains illegal keyword in the Attachment Name, judges that this mail is a spam, abandons.Otherwise, proceed according to filtering process.

(7) the annex body matter is judged.If content is judged to be spam, then this mail is also delivered, in user's certain hour, do not handle, be used as spam and delete.

(8) if the text annex is arranged, attachment content is filtered.If attachment content is judged to be spam by the content determination module, is judged as spam as text and handles.

Core of the present invention is the information filtering stage shown in Figure 2 just.In the information filtering stage, at first propose the message body part, and message body is cut speech.Because it is not perfect that system made sure to keep in mind in Chinese, therefore,, therefore, be necessary the cutting result is carried out the semanteme reduction through the lost part of Mail Contents information meeting later semanteme.

(1) pretreatment stage:

Preprocessing process of the present invention comprises two stages.Phase I is cutting result's a semantic reduction phase; Second stage is for removing the stop words stage.

(a) semantic reduction phase

This stage mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word, and its basic process is as follows:

The identification of basic phrase is a text that the input participle marked, the process of the phrase text that output identifies.The feature of input is made up of two parts, and a part is a condition, another part rule, the action that carry out the back that promptly satisfies condition.Therefore, we merge template by formulating basic phrase condition for identification template with rule, utilize the best basic phrase of maximum informational entropy identification at last.

Consider that Chinese is that a kind of meaning is closed language, word order has bigger influence to Chinese semanteme, and the Chinese style of writing modes that adopt from left to right more, and centre word is positioned at back one speech mostly, therefore adopt mode from back to front in the phrase identifying, the i.e. mode of falling row, the present invention here selects " stack " structure as the storage data for use.

Because therefore speech and context dependent in the statement need to consider information such as current speech, front and back speech, part of speech and speech syllable number.

Therefore, according to influencing the factor that phrase constitutes, the defined feature space is:

1. part of speech information.The part of speech of each two speech of current speech and front and back;

2. speech.Some that before and after the current speech current speech structure phrase are impacted have the word of specific use.As " ", " " wait some function words.

3. mark classification.Mark the classification that current speech should belong to, we are defined as noun phrase class and two classifications of verb phrase class.

4. syllable number.Consider the syllable number of each speech of current speech and front and back.For fear of the sparse property of data, mostly phrase is when merging that two speech merge, and when three speech phrases merged, emphasis was considered monosyllabic.

5. punctuate.To some specific punctuates of impacting of structure phrase, as ", ".

According to above-mentioned feature space definition condition for identification, in the formulation process of basic phrase condition for identification, we have defined the condition template, and are as shown in table 1.

Table 1 characteristic condition template

When characteristic function was got particular value, this condition template was obtained concrete feature by instantiation." Modern Chinese corpus processing---word segmentation and the part-of-speech tagging standard " that part-of-speech tagging adopts Beijing University's computational language to be formulated, for as " ", " ", " ", " with " wait the special speech of some border property signs, we draft a border vocabulary in advance, are used for the identification of phrasal boundary; For the border of better recognition phrase, I draft a border part of speech table in addition, comprise some parts of speech such as conjunction, punctuate.

With by the characteristic condition template after the instantiation as Rule of judgment, judge whether input is satisfied phrase and merged rule (it is as shown in table 2 that part merges rule), satisfied then carry out phrase and merge, otherwise carry out next step judgement, whole like this matching process, be converted into the two-value assorting process, this feature can be expressed as two-value characteristic function form.As article one rule two-value characteristic function in the table 2 be:

Table 2 partial phrase merges rule

(b) remove stop words

Here the removal method of stop words, the present invention adopts stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word, and stop words.

(2) feature selecting

The purpose of feature selecting is exactly to continue to carry out feature extraction to remove step result afterwards through stop words, thereby extracts the characteristic item of representing mail features, not only and remove those and do not have separating capacity, bring the item of interference on the contrary to e-mail analysis.The present invention mainly utilizes the information gain algorithm to carry out feature selecting.Yet, but exist a large amount of language material energy imbalances in the true experimental situation, and filtrating mail is one two classification problem, to some stronger dependence of balance of language material, therefore the present invention is directed to this situation the information gain algorithm is improved:

(a) dispersion between the class of characteristic item

According to knowledge of statistics, average is exactly simple mean deviation amount, is the distance that independent side-play amount departs from average and variance is weighed, average and variance can the discipline of characterization corpus and document between range distribution.Formula (2) can be used for estimate variance:

S^{2} = \frac{Σ_{i = 1}^{n} {(d_{i} - \overset{&OverBar;}{d})}^{2}}{n - 1} - - - (2)

N is the sum of classification in the language material in the formula,

Be the classification mean value of characteristic item t, d _iSum frequency for characteristic item t.

Here we can use the word frequency information tf of characteristic item _iReplace the d in the formula (2) _i, utilize average word frequency information

Improve the correlation computations method.Therefore, dispersion (Distribution Information Among a Class) just can be expressed as formula (3) between the class of characteristic item

DIac = \frac{\sqrt{[Σ_{i}^{n} {({tf}_{i} (t) - \overset{&OverBar;}{tf (t)})}^{2}]} / n - 1}{tf (t)} - - - (3)

We utilize DIac=S to represent the distribution of a t between of all categories, and as seen, when item t only all occurred in all texts a classification, DIac obtained maximum, and this moment E (t),

Value be 0, IG (D t) gets maximum, this moment t discrimination the strongest; When the TF homogeneous phase while of t in each classification, DIac obtains minimum value, and this moment, the discrimination of t was the most weak.

(b) dispersion in the class of characteristic item

In like manner, we can utilize variance to describe distributed intelligence in the class of characteristic item t, are called for short dispersion DIic (Distribution Information Inside a Class) in the class.Its computing formula is shown in (4):

DIic = \frac{\sqrt{[Σ_{j}^{m} {({tf}_{j} (t) - \overset{&OverBar;}{t f^{'} (t)})}^{2}]} / m - 1}{t f^{'} (t)} - - - (4)

Wherein, tf _j(t) represent the frequency that t occurs in a j piece of writing, n is total number of files in the class.

Be the mean value of item t occurrence frequency in each piece document, tf ' (t) represents total frequency in each piece of t document.By formula (4) as can be known, when DIic all occurred in all documents in this classification, DIic obtained minimum value, and this moment, the separating capacity of characteristic item t was the strongest, and the size and the classification capacity of visible DIic value are inversely proportional to.

(c) improved information gain algorithm

The size of the value of dispersion DIic has exactly disclosed the state whether characteristic item is in balance in the characteristic item class as the above analysis.It is more inhomogeneous to distribute in the big more characterization more of the DIic value Xiang Zaiben class, and when t only occurred in one piece of document a classification, DIic obtained maximum 1, and the distribution of t this moment in this class is highly unbalanced; The class distributed intelligence that dispersion (DIac) has disclosed characteristic item between the class of characteristic item simultaneously, the class of characteristic item distribute more uneven, and the value of dispersion is littler between class.

The class that exists height is uneven uneven with item if characteristic item distributes, be that characteristic item t only highly occurs that (at this moment t is still a bigger value about the information gain of its classification in several pieces of documents in a certain classification, such characteristic item is selecteed, be not desired result) DIic this moment bigger value often, the DIac value is less simultaneously.So our facility uses DIic＞DIac as Rule of judgment, the information gain formula third part of giving with good conditionsi is added a penalty term DIac, the negative effect of situation does not appear with balance characteristics, thus reduce be chosen in the classification occurrence number few and in other classifications the more characteristic probability of occurrence number.The information gain of increase penalty factor (Gain Distribution Information, GDI) formula is as follows:

GDI (D, t) = - Σ_{i = 1}^{m} P (C_{i}) \log P (C_{i}) + P (t) Σ_{i = 1}^{m} P (C_{i} / t) \log P (C_{i} | t) - - - (5)

+ P (\overset{&OverBar;}{t}) Σ_{i = 1}^{m} P (C_{i} | \overset{&OverBar;}{t}) \log P (C_{i} | \overset{&OverBar;}{t}) \times DIac

Because formula (5) is to a kind of punishment of situation under the uneven situation of feature the characteristic item item not occurred, yet but be not suitable for the comparatively situation of balance of characteristic distribution, therefore, consider to utilize DIic as Rule of judgment traditional information gain algorithm (IG) to be combined with GDI, form new feature selecting algorithm IG-GDI with DIac.Thereby when overcoming conventional information gain defective, kept its advantage.Algorithm flow is as follows:

Input: the number of all features of v-

The character subset that output: F-selects

With DF deletion low frequency word

Calculate all characteristic item DI _Ic

Calculate DI _Ac

If DI _Ic＞DI _Ac

Carry out formula (5) GDI

Otherwise

Carry out IG

If IG or GDI value are bigger

T is joined among the set F

(3) utilize svm classifier

(a) weighting SVMs filtering model

(Weighted Support Vector Machines WSVM) is generally used for solving classification problem under the uneven situation of sample to the weighting SVMs.It is considered herein that the great disparity except sample size of all categories, the significance level difference of classification also can cause the imbalance of sample.Be directed to the Spam filtering problem, the normal email significance level is obviously than spam significance level height.When guaranteeing nicety of grading, should avoid erroneous judgement, so its essence of filtrating mail problem also is the unbalanced classification problem of sample to normal email as far as possible.

Being provided with the mail training sample set is expressed as follows:

({\overset{&RightArrow;}{x}}_{1}, y_{1}), ({\overset{&RightArrow;}{x}}_{2}, y_{2}), . . ., ({\overset{&RightArrow;}{x}}_{l}, y_{l}), {\overset{&RightArrow;}{x}}_{i} {&Element; R}^{n}, y_{i} &Element; {- 1, + 1} - - - (6)

Wherein

The vector of representing i envelope mail.y _iBe key words sorting, y _i=1 to represent i envelope mail be normal email, y _i=-1 to represent i envelope mail be spam.Spam filtering model representation based on the weighting SVMs is as follows:

Min \frac{1}{2} {| | \overset{&RightArrow;}{w} | |}^{2} + Cσ Σ_{i = 1}^{l} s_{i} ξ_{i} - - - (7)

s . t . y_{i} ({\overset{&RightArrow;}{w}}^{T} Φ ({\overset{&RightArrow;}{x}}_{i}) + b) &GreaterEqual; 1 - ξ_{i}

ξ wherein _i〉=0, i=1,2..., l,

Being kernel function, based on the svm classifier device of basic kernel function radially spam being had filter effect preferably, be the kernel function of often using, so this paper selects radially basic kernel function for use

s _iIf＞0 expression sample importance weight is 0＜s _i＜1 expression sample Inessential; If s _i=1 expression

Generally important; If s _i＞1 expression

Important.The sample class weights are σ 〉=1, and the sample that belongs to identical category has identical classification weights.The weighting SVMs is compared with the standard SVMs, and the most outstanding advantage has been its obfuscation to the wrong punishment that divides of sample, promptly the slack variable of each sample be multiply by the importance weight and the classification weights of sample correspondence.

As follows to formula (7) structure Lagrange function:

Φ (\overset{&RightArrow;}{w}, b, α) = \frac{1}{2} {| | \overset{&RightArrow;}{w} | |}^{2} + Cσ Σ_{i = 1}^{l} s_{i} ξ_{i} - Σ_{i = 1}^{l} α_{i} (y_{i} ({\overset{&RightArrow;}{w}}^{T} Φ ({\overset{&RightArrow;}{x}}_{i}) + b) - 1 + ξ_{i}) - Σ_{i = 1}^{l} β_{i} ξ_{i} - - - (8)

α wherein _i, β _iBe the Lagrange multiplier, order

\frac{&PartialD; Φ}{&PartialD; w} = \overset{&RightArrow;}{w} - Σ_{i = 1}^{l} α_{i} y_{i} Φ ({\overset{&RightArrow;}{x}}_{i}) = 0 - - - (9)

\frac{&PartialD; Φ}{&PartialD; b} = - Σ_{i = 1}^{l} α_{i} y_{i} = 0 - - - (10)

\frac{&PartialD; Φ}{{&PartialD; ξ}_{i}} = σ s_{i} C - α_{i} - β_{i} = 0 - - - (11)

With formula (9)-(11) substitution Lagrange function, then the dual problem of the optimal solution problem of this weighting SVMs is:

\underset{α}{Max} - \frac{1}{2} Σ_{i = 1}^{l} Σ_{j = 1}^{l} α_{i} α_{j} y_{i} y_{j} K ({\overset{&RightArrow;}{x}}_{i} {\overset{&RightArrow;}{x}}_{j}) + Σ_{i = 1}^{l} α_{i}

s . t . Σ_{i = 1}^{l} α_{i} y_{i} = 0

0 \leq α_{i} \leq σC s_{i}, i = 1,2, . . ., l - - - (12)

Find the solution quadratic programming formula (12) and obtain Lagrange coefficient optimal solution

Substitution obtains:

{\overset{&RightArrow;}{w}}^{*} = Σ_{i = 1}^{l} α_{i}^{*} y_{i} K ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) - - - (13)

b^{*} = y_{i} - Σ_{i = 1}^{l} y_{i} α_{i}^{*} y_{i} K ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) - - - (14)

Obtain optimum classifier at last:

f ({\overset{&RightArrow;}{x}}_{j}) = sgn (Σ_{i = 1}^{l} y_{i} a_{i}^{*} K ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) + b^{*}) - - - (15)

Wherein,

The upper bound that can draw the Lagrange coefficient is along with sample changes the difference of the significance level of classification and affiliated classification.

(b) examine based on the word sequence of the classification degree of correlation

Current research ubiquity at the Spam filtering problem is ignored the mail text structure and the problem that causes a large amount of semantic informations to be lost, and this directly causes filter algorithm to be difficult to excavate the semantic information that best embodies mail essence.

The present invention is directed to the problems referred to above, proposed the word sequence nuclear of a kind calculation of correlation, and be applied to the SVM Spam filtering.At first extract the classification calculation of correlation of mail text feature and calculated characteristics, utilize word sequence nuclear then, utilize the classification calculation of correlation to calculate the decay factor of speech in the training process, at last mail is classified as kernel function training SVMs.

1. word sequence is examined

Make ∑ represent an a limited number of set of words, s=s ₁s ₂S _{| s|}Represent any word sequence, wherein the length of s is | s|, suppose i=[i again ₁..., i _n] be illustrated in the call number among the word sequence s, 1≤i ₁≤ ... ≤ i _n≤ | s|, order

Be illustrated in subsequence among the s, exist in this subsequence

With May not adjacent speech in word sequence s, obvious s[i] the ∈ ∑ ⁿFor length be | μ | subsequence μ=μ ₁... μ _{| μ |}, suppose in word sequence s and find s[i], satisfy μ=s[i], also be

The length that makes l (i) expression subsequence μ in s, be grown up to, then l (i)=i _{| μ |}-i1+1, i _{| μ |}Expression μ _{| μ |}Corresponding call number, then the word sequence nuclear between two word sequence s and the t is:

K_{wn} (s, t) = \underset{μ &Element; Σ^{n}}{Σ} \underset{i : μ = s [i]}{Σ} \underset{j : μ = t [j]}{Σ} \underset{1 \leq j \leq | μ |}{Π} λ_{m, u_{j}}^{2} \underset{i_{1} \leq k \leq i_{| μ |}, k &NotElement; i}{Π} λ_{g, s_{k}} \underset{j_{1} \leq l \leq j_{| μ |}, l &NotElement; j}{Π} λ_{g, t_{l}} - - - (16)

Wherein, λ is the attenuation coefficient of the discontinuous subsequence of punishment, λ _{M, x}And λ _{G, x}Two attenuation coefficients that expression is set same speech x, λ _{M, x}Refer to the attenuation coefficient when speech x is match point, λ _{G, x}Refer to the attenuation coefficient when speech x is gap point, do not have necessary relation between the two.For example, when calculating above-mentioned kernel function, establish λ _m=1, with regard to expression the speech as match point is not punished so; Make λ _g=0, then be illustrated in and do not allow to occur non-continuous series in the matching process.

2. the introducing of classification calculation of correlation

Consider that the punishment to each vocabulary should be relevant with its class discrimination ability, the speech that separating capacity is big should be greatly as the attenuation coefficient of match point, and the attenuation coefficient during as spaced points should be smaller, and no matter the little speech of class discrimination ability is as match point or spaced points, its role is all answered relative equilibrium, this paper introduces the classification calculation of correlation (Dependence Measure DP) as the foundation of calculating attenuation coefficient, provides concrete computational methods below for this reason.

If L represents the normal email collection, S represents spam collection, t _kBe certain feature speech, think so

DP(t _k)＝(p(t _k|L)-p(t _k|S)) ² (17)

Be t _kThe classification calculation of correlation.

Wherein, p (t _k| L) and p (t _k| when S) being illustrated respectively in known class L or S, t _kThe probability that occurs.And if only if t _kWhen the probability that occurs in positive and negative two class texts equates, DP (t _k) obtain minimum value 0; And if only if t _kAnd when only in the text of a classification, occurring, DP (t _k) reach maximum 1.t _kThe match point attenuation coefficient

Be directly proportional DP (t with its classification calculation of correlation _k) about variable p (t _k| L)-p (t _k| S) be the parabola that opening makes progress, when | p (t _k| L)-p (t _k| S) | in 0 epsilon neighborhood during value, DP (t _k) value is very little and variation is comparatively smooth, this is unfavorable for the calculating of kernel function similarity.Therefore with classification calculation of correlation DP (t _k) when introducing kernel function, adopt following computing formula:

λ_{{m, t}_{k}} = \sqrt{| p (t_{k} | L) - p (t_{k} | S) |} - - - (18)

Can use in the computational process with frequency replaces the mode of probability to estimate above-mentioned probable value:

p (t_{k} | s) = \frac{# (t_{k} in c)}{# (* in c)}, c &Element; {L, S} - - - (19)

Wherein, # (t _kIn c) expression t _kThe number of times that occurs in c, wherein c represents L or S.Can think:

λ_{g, t_{k}} = 1 - \sqrt{| p (t_{k} | L) - p (t_{k} | S) |} - - - (20)

In addition, be the very little problem of avoiding causing of similarity, utilize the linear behavio(u)r of kernel function to do following adjustment because of sparse property:

K_{n} (s, t) = Σ_{i = 1}^{n} μ^{1 - i} {\hat{K}}_{i} (s, t) - - - (21)

Wherein, the exponential function of μ is as weight coefficient,

Be through the nuclear of the word sequence after the standardization:

{\hat{K}}_{i} (s, t) = \frac{K_{i} (s, t)}{\sqrt{K_{i} (s, s) K_{i} (t, t)}} - - - (22)

Through above-mentioned steps, realized classification to e-mail messages, the present invention here is divided into three classifications to Mail Contents, that is: normal email, spam and doubtful spam.For normal email, the present invention gives normal delivery, and with Direct Filtration, and for doubtful spam, to be on the safe side, the present invention delivers, and needs doubtful spam information of field feedback here for spam.

Mail feedback: (1) pseudo-feedback fine setting class template.In system's running, extract some very typical documents.With these document fine setting class templates.Make class template more near real class template, improve the accuracy and the availability of filtering.

The present invention proposes a new formula that on-the-fly modifies weight:

Pi _new＝αPi _old+βID (23)

Wherein, Pi _NewBe that the i classification is filtered a certain document and amended template vector, Pi _OldIt is the template vector before the i classification is filtered a certain document, α is the old template vector of representative shared ratio in modification process, β is the modification factor of this document vector in revising old template procedure, and alpha+beta=1, α=0.95 that this paper adopts, β=0.05, D is the document vector, I is a linear critical value function, is defined as follows:

I = \{\begin{matrix} 0 . . . if (θ_{doc} < θ) \\ 1 . . . if (θ_{doc} &GreaterEqual; θ) \end{matrix} - - - (24)

θ is a critical value, and we are divided into 2 classes substantially to all documents, for one piece of document, by cutting speech, feature selecting is transformed into a document vectors, by the similarity comparison, can be divided into a certain class wherein, the similarity value of this document and this class vector is θ _Doc, if θ _DocMore than or equal to we given threshold value θ, revise the class template vector, otherwise do not revise.In order to reduce unnecessary feedback as far as possible, need to select a suitable threshold, determining of threshold value is a quite complicated process, we will be by doing a large amount of experiments, rule of thumb value.This paper gets threshold value θ=0.07 through doing a large amount of experiments.

(2) show feedback.

Our mailing system collection server and client are one.So we also adopt explicit feedback to improve filter effect when adopting pseudo-feedback.At the web interface, the user reports for some relatively more typical spam.We are put into such spam under the specific feedback training file.After a specific time, we just start our feedback training correction class template.The feedback training process is as follows: at first, according to certain rule, delete some underproof report mails (too short by the report mail such as what have, as only to have only one to two row).Then, for all remaining feedback documents, we generate a vector according to class template.Utilize this vector to finely tune our class template according to pseudo-feedback algorithm formula; Make our class template approach real class template more.

If the initial training collection is T, the feedback document sets is M; Feedback algorithm is described below:

The first step: after the training algorithm training, generate feature vocabulary Tl of all categories (l is the feature item number, comprises feature speech, feature speech weight, word frequency, document frequency in the table), class conditional probability table at training set, generate preliminary classification device C.

Second step: set feedback threshold, use the preliminary classification device that test set is classified, similarity is collected such other feedback document sets greater than the document of classification feedback threshold.

The 3rd step: train feedback document sets of all categories to generate optimal characteristics vocabulary Mt (t is the feature item number) with improved feature selection approach, the information such as number of files of all categories that the statistics feedback is concentrated.

The 4th step: according to each classification list item of feedback collection generation, for grader C regenerates the class conditional probability table of new category feature vocabulary Pn (n is preceding n the feature of feature item number size for weight maximum in l+t the characteristic item), classification prior probability table and each characteristic item to revise grader C.

Though above-mentionedly in conjunction with the accompanying drawings the specific embodiment of the present invention is described; but be not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims

1. multi-level spam intelligently filters method is characterized in that the filtration step of this method is as follows:

Step5: filter according to the mail keyword then;

2. as claims 1 described a kind of multi-level spam intelligently filters method, it is characterized in that, among the described step4, the process of carrying out the black and white lists filtration is as follows: at first the IP address of mail is tentatively filtered, as the IP address in white list, then being judged to be legitimate mail receives, otherwise judge the IP address whether in blacklist, as then being judged to be spam, and abandon, otherwise dns address is mated, as the match is successful then is judged to be legitimate mail and receives with the DNS white list, otherwise mate, as the match is successful with the DNS blacklist, then be judged to be spam and abandon, otherwise the mail matter topics keyword is mated.

3. as claims 1 described a kind of multi-level spam intelligently filters method, it is characterized in that among the described step6, the process that Mail Contents is judged is as follows:

Step 1: at first extract the message body part, and message body is cut speech;

Step 2: carry out preliminary treatment to cutting the speech result;

Step 3: pretreated mail is carried out feature selecting;

Step 4: the characteristic use SVMs SVM that extracts is classified;

4. as claims 3 described a kind of multi-level spam intelligently filters methods, it is characterized in that, described pretreated process is as follows: at first the cutting result is carried out the semanteme reduction, it mainly is to utilize the method for rule to reorganize to the cutting result, extracts basic phrase and unregistered word; Adopt stop words to represent that the method that combines with part-of-speech tagging removes those high frequency words and low-frequency word then.

5. as claims 3 described a kind of multi-level spam intelligently filters methods, it is characterized in that the described process of utilizing SVMs SVM to classify is as follows:

(1) extracts the mail text feature;

(2) calculate other relativity measurement of feature class of each feature;

(3) utilize word sequence nuclear as authorizing function training SVMs;

(5) mail is classified.