CN101494546B

CN101494546B - Method for preventing collaboration type junk mail

Info

Publication number: CN101494546B
Application number: CN2009100286953A
Authority: CN
Inventors: 曹玖新; 罗军舟; 林加镇; 姚燚; 刘永生; 孙学胜
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2009-01-05
Filing date: 2009-01-05
Publication date: 2011-04-20
Anticipated expiration: 2029-01-05
Also published as: CN101494546A

Abstract

A cooperative spam prevention method mainly solves the existing e-mail security problem of Internet, relating to relevant technologies such as honeypot principle, spam filtering, and the like. The method applies distributed network structure, and comprises an e-mail server and a directory server. In the system structure, the e-mail server is the main body for filtering spam, and collects spam samples simultaneously. To improve the timeliness of sample collection, the characteristics of the collected samples are extracted and initial judgment (first level judgment) is carried out, and then 'possible spam' characteristics submitted to the directory server. The directory server carries out second filtering (second level judgment) to the received 'possible spam' characteristics, generates a spam filtering rule and stores the rule into a rule base, and then releases the updated rule to the local rule base of each e-mail server through a unified interface, thus realizing the sharing and updating of the filtering rules and achieving the purpose of cooperative spam prevention.

Description

Method for preventing collaboration type junk mail

Technical field

The present invention is a kind of correlation techniques such as honey jar principle and Spam filtering of having utilized, and relates to particularly safety of electronic mail field of network security.

Background technology

And in recent years, new variation has appearred in the mode of propagation of spam and content, and its harm is serious day by day: taken a large amount of Internet resources, reduced network operating efficiency; Waste a large amount of time of mail user; Become virus, the main route of transmission of wooden horse and phishing, serious threat network security; In addition, spam is propagated flame, and society is caused serious negative effect.Yet existing Spam filtering technology but can not be tackled this situation well.Further improve the Internet and resist the spam risk ability, satisfy the demand of user better, become urgent task in the network safety filed anti-spam.

The Spam filtering technology that generally adopts mainly contains at present:

One, the filtering technique of black and white lists.Black and white lists tabulation of this Technology Need user manual maintenance, and carry out Spam filtering on this basis.

Two,, based on the filtering technique of adding up.Existing technology based on statistics, for example SVMs (learn and conclude the statistical law of sample, and on this basis new mail is carried out identification and classification by Support VectorMachine, SVM) method and Bayes method etc.

Three, rule-based filtering technique.Existing rule-based method obtains explicit rule by the training to great amount of samples, utilizes these rules that spam is filtered again.Rule-based method mainly contains the Ripper method, traditional decision-tree and Boosting method etc.

There is very big deficiency in above Spam filtering technology.At first lack collaborative,, lack information exchange, do not form system, therefore can't cooperate mutually, carry out anti-spam on a large scale because each mail server filters spam independently; Secondly, the computation complexity height, in order to obtain filtering rule or spam feature, must be to a large amount of sample training, the sample drawn content, filtering rule is also complicated, causes higher computational complexity.Once more, lack real-time, filtering rule or spam feature update cycle are longer, and renewal speed is slow, can't filter up-to-date spam, has hysteresis quality; Lack adaptability at last, the form of spam has produced new variation on the network at present, and having occurred with non-textual formats such as pictures is the spam of content.And therefore existing content-based filter method can't filter the spam of propagating with the picture form to analyze the content of text of mail, causes system a large amount of failing to judge to occur, and has reduced the filtration accuracy rate.

Summary of the invention

Technical problem: the present invention is directed to the deficiency of conventional garbage filtrating mail technology, propose a kind of new collaboration type rubbish mail filtering method.This method has collaborative, and tachysynthesis and adaptive characteristics can be carried out catching rubbish mail on a large scale under internet environment.The present invention integrates each mail server by introducing LIST SERVER.In mail server, utilize the honey jar technology to carry out sample collection, judge that through one-level the back is submitted to LIST SERVER to the characteristic vector of sample, directory service carries out judging the second time that to these characteristic vectors the screening back generates filtering rule, then the filtering rule that has upgraded is published in other mail server, realizes preventing collaboration type junk mail.

Technical scheme: method for preventing collaboration type junk mail of the present invention is specific as follows:

The first step, mail server reads the relevant information of email account, and according to honey jar account judgement schematics each account is marked in conjunction with these information, then according to from big to small order in the accounts database that calculates resulting score value writing system, updated account grade form and according to the honey jar selection algorithm, thus from the email account of system, select the account of some to generate the honey jar set as the honey jar account;

Second step, according to the behavioural characteristic of spam, the regular feature of from these honey jars account set, carrying out the mail sample collection and extracting sample, the composition characteristic vector is also represented sample set with these characteristic vectors;

In the 3rd step, utilizing these sample characteristics vectors multiplicity in the honey jar set is the number of times that sample occurs, and it is judged to be that one-level is judged for the first time, thereby tentatively improves the accuracy of sample;

In the 4th step, mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level, and LIST SERVER carries out judging the second time to be that secondary is judged to these characteristic vectors again, filters out the higher characteristic vector of accuracy and generates filtering rule;

In the 5th step, LIST SERVER is published to newly-generated filtering rule in the rule base of each mail server and upgrades, and each mail server utilizes these rules of having upgraded to carry out Spam filtering when receiving new Email.

The process that generates the honey jar set is initial honey jar set at first to be set for empty, account's grade form in system's reading database then, preferential select the higher account of mark and these accounts are added in the into honey jar set, behind each increase honey jar account, the spam quantity in the pair set is added up; Because the spam quantity in the server is limited, therefore along with the continuous increase of gathering, the spam quantity that collects will be tending towards definite value from set, and the spam increment after increasing the honey jar account in the set is during less than the threshold value determined, just can determine that final honey jar gathers;

Carrying out the honey jar account when selecting, its judgement schematics of determining candidate's honey jar is:

V (t, Δt) = (λ_{1} * \frac{S_{1} (t - Δt)}{H_{1} (t - Δt)} + (1 - λ_{1}) * \frac{S_{2} (Δt)}{H_{2} (Δt)}) * λ_{2} + S_{2} (Δt) * (1 - λ_{2})

Wherein:

V: certain account's score in the system, expression this account is chosen as the possibility size of honey jar; The value of V is big more, and then to be chosen as honey jar account's possibility big more for this account, otherwise more little;

T: time variable, the moment point that the expression algorithm is carried out;

Δ t: time interval variable, the time interval of twice execution algorithm before and after the expression;

S ₁(t-Δ t): the spam history sum that the expression account received constantly at (t-Δ t);

S ₂(Δ t): the spam sum that the expression account received in the Δ t time period recently;

H ₁(t-Δ t): the historical sum of legitimate mail that the expression account received constantly at (t-Δ t);

H ₂(Δ t): the legitimate mail sum that the expression account received in the Δ t time period recently;

λ ₁: weights, value is between 0 and 1.This value can be regulated according to real system;

λ ₂: weights, value is between 0 and 1.This value can be regulated according to real system.

The mail sample collection is meant because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, this need add up the distribution of an envelope mail in the honey jar set, promptly receives account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather;

The feature of extracting sample is meant that the spam sample to collecting carries out feature extraction, represents sample with the form of characteristic vector, so that follow-up storage and calculating from honey jar account set; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;

The characteristic vector form of described sample is as follows:

F＝<SA，SIP，FP>

The meaning of each component is as shown in the table among the characteristic vector F:

The component name	The meaning of component
		SA	Sender's addresses of items of mail, i.e. Return-Path part in the mail header information
SIP	Mail sources IP, first IP address in the mail header information in last Received field
		FP	The finger print information of Mail Contents

The process that one-level is judged is to count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;

The calculating process that one-level is judged is:

C = A_{m \times n} \times S = [\begin{matrix} R_{1} \\ R_{2} \\ . \\ . \\ . \\ R_{m} \end{matrix}] \times S = [\begin{matrix} R_{1} \times S \\ R_{2} \times S \\ . \\ . \\ . \\ R_{m} \times S \end{matrix}] = [\begin{matrix} Σ_{i = 1}^{n} r_{1 i} \times s_{1 i} \\ Σ_{i = 1}^{n} r_{2 i} \times s_{2 i} \\ . \\ . \\ . \\ Σ_{i = 1}^{n} r_{mi} \times s_{mi} \end{matrix}] = [\begin{matrix} c_{1} \\ c_{2} \\ . \\ . \\ . \\ c_{m} \end{matrix}]

Wherein C is the confidence level matrix, and LIST SERVER carries out programmed screening according to the confidence level matrix to feature;

Described mail server is submitted to LIST SERVER to the vector set that generates by unified interface, open up special buffering area in the LIST SERVER, be used for storing pending set of eigenvectors, when LIST SERVER receives the set of eigenvectors that certain mail server sends, temporarily it is deposited in the buffering area of system, when the characteristic vector set that receives reaches some, just it is carried out secondary and judge.

Secondary is judged the accuracy of promptly discerning spam according to each mail server, and the multiplicity of characteristic vector in each mail service, realizes the associating judgement, calculates the confidence level of spam feature, eliminates the lower characteristic vector of confidence level;

It is that LIST SERVER utilizes the accuracy matrix of each mail server and the multiplicity matrix of characteristic vector to carry out computing that secondary is judged, thereby generates the confidence level matrix of each characteristic vector; Wherein the accuracy matrix is:

S ^T＝[s ₁ s ₂ ... s _n]

s _iThe accuracy size of expression server i identification spam.The multiplicity matrix is:

A_{m \times n} = [\begin{matrix} R_{1} \\ R_{2} \\ . \\ . \\ . \\ R_{m} \end{matrix}] = [\begin{matrix} r_{11} & r_{12} & . . . & r_{1 n} \\ r_{21} & r_{22} & . . . & r_{2 n} \\ . & . & . & . \\ . & . & . & . \\ . & . & . & . \\ r_{m 1} & r_{m 2} & . . . & r_{mn} \end{matrix}]

A _{M * n}In, m represents different characteristic vector numbers, n represents mail server quantity, R _pThe multiplicity matrix of representation feature vector p, r _PqThe multiplicity of representation feature vector p in mail server q;

Filtering rule described in the 4th step comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.

Renewal process in the 5th step refers to read the filtering rule that has upgraded from rule base, publish to then in each mail server, realizes sharing and upgrading of mail server filtering rule, reaches the purpose of collaboration type anti-spam.

When in the 5th step new Email being carried out Spam filtering, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.

Beneficial effect: characteristics of the present invention are to introduce LIST SERVER in distributed environment, each mail server is integrated, and designed the two-stage judgment mechanism and " doubtful spam " sample is judged and screened, and have improved regular accuracy.The present invention has collaborative, and tachysynthesis and adaptive special advantage can be carried out catching rubbish mail on a large scale under internet environment.The result who surveys according to test shows that the present invention is being in the leading level in the world aspect the spam cooperation strick precaution.

Description of drawings

Fig. 1 selects flow chart for honey jar account of the present invention;

Fig. 2 is a preventing collaboration type junk mail system schematic diagram of the present invention;

Fig. 3 is a Spam filtering flow chart of the present invention.

Embodiment

Method of the present invention further describe into:

A. mail server reads the relevant information of email account, and according to honey jar account judgement schematics each account is marked in conjunction with these information, then according to from big to small order in the accounts database that calculates resulting score value writing system, the updated account grade form;

B. determine the honey jar set: in this stage, initial honey jar set at first is set for empty, account's grade form in system's reading database then, preferential select the higher account of mark and these accounts are added in the into honey jar set, behind each increase honey jar account, the spam quantity in the pair set is added up; Because the spam quantity in the server is limited, therefore along with the continuous increase of gathering, the spam quantity that collects will be tending towards definite value from set, and the spam increment after increasing the honey jar account in the set is during less than the threshold value determined, just can determine that final honey jar gathers;

C. sample collection: because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, need the distribution of statistics one envelope mail in the honey jar set, promptly receive account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather;

D. feature extraction: the spam sample that collects from honey jar account set is carried out feature extraction, represent sample with the form of characteristic vector, so that follow-up storage and calculating; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;

E. one-level is judged: at first count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;

F. secondary is judged: according to the accuracy of each mail server identification spam, and the multiplicity of characteristic vector in each mail service, realize the associating judgement, calculate the confidence level of spam feature, eliminate the lower characteristic vector of confidence level;

G. generate filtering rule and filtering rule is distributed to each mail server, the filtering rule that each mail server utilization has been upgraded carries out Spam filtering.

V (t, Δt) = (λ_{1} * \frac{S_{1} (t - Δt)}{H_{1} (t - Δt)} + (1 - λ_{1}) * \frac{S_{2} (Δt)}{H_{2} (Δt)}) * λ_{2} + S_{2} (Δt) * (1 - λ_{2})

Wherein:

The characteristic vector form of sample is as follows:

F＝<SA，SIP，FP>

The meaning of each component is as shown in the table among the characteristic vector T:

The component name	The meaning of component
		SA	Sender's addresses of items of mail, i.e. Return-Path part in the mail header information

SIP	Mail sources IP, first IP address in the mail header information in last Received field
		FP	The finger print information of Mail Contents

For the finger print information that obtains Mail Contents be characteristic vector the FP component, this paper adopts the improvement version of the Nilsimsa method of abstracting of open source code.Nilsimsa is actually a kind of hash algorithm, and it has bigger advantage aspect the similitude calculating of mail.

S ^T＝[s ₁ s ₂ ... s _n]

A_{m \times n} = [\begin{matrix} R_{1} \\ R_{2} \\ . \\ . \\ . \\ R_{m} \end{matrix}] = [\begin{matrix} r_{11} & r_{12} & . . . & r_{1 n} \\ r_{21} & r_{22} & . . . & r_{2 n} \\ . & . & . & . \\ . & . & . & . \\ . & . & . & . \\ r_{m 1} & r_{m 2} & . . . & r_{mn} \end{matrix}]

A _{M * n}In, m represents different characteristic vector numbers, n represents mail server quantity, R _pThe multiplicity matrix of representation feature vector p, r _PqThe multiplicity of representation feature vector p in mail server q.The calculating process that one-level is judged is:

C = A_{m \times n} \times S = [\begin{matrix} R_{1} \\ R_{2} \\ . \\ . \\ . \\ R_{m} \end{matrix}] \times S = [\begin{matrix} R_{1} \times S \\ R_{2} \times S \\ . \\ . \\ . \\ R_{m} \times S \end{matrix}] = [\begin{matrix} Σ_{i = 1}^{n} r_{1 i} \times s_{1 i} \\ Σ_{i = 1}^{n} r_{2 i} \times s_{2 i} \\ . \\ . \\ . \\ Σ_{i = 1}^{n} r_{mi} \times s_{mi} \end{matrix}] = [\begin{matrix} c_{1} \\ c_{2} \\ . \\ . \\ . \\ c_{m} \end{matrix}]

Wherein C is the confidence level matrix, and LIST SERVER carries out programmed screening according to the confidence level matrix to feature.

Filtering rule comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.

From rule base, read the filtering rule that has upgraded, publish to then in each mail server, realize sharing and upgrading of mail server filtering rule, reach the purpose of collaboration type anti-spam.

When mail server is received the new Email of an envelope, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.

As shown in Figure 2, deployment of the present invention need make up a network of being made up of LIST SERVER and some mail servers.In mail server, its main composition module and function are as follows:

(1) sample collection module.By sample collection, feature extraction and one-level judge that three submodules constitute.Wherein, the historical information of sample collection submodule statistics mail account, the email account of selecting some from server utilize the mass-sending feature of spam regularly to gather doubtful spam sample from the honey jar account aggregation as the honey jar account; The feature extraction submodule is header information and the Mail Contents finger print information (rather than analyzing Mail Contents itself) by extracting doubtful spam mainly, generates the characteristic vector of lightweight; One-level judges that submodule then according to the multiplicity of sample in the honey jar account aggregation, carries out judging the first time and screening to the feature that extracts, and is submitted to LIST SERVER then.

(2) Policy Updates module.The up-to-date filtering rule of reception LIST SERVER issue also is dumped in the local rules repository.

(3) Spam filtering module.From the mail buffer queue, receive mail, extract its characteristic vector, the rule whether coupling is arranged in the retrieval local rules repository, if the match is successful, then this mail is judged to be spam, otherwise, set a cache-time, and deposit this mail in the user buffering district, if in cache-time, occur the rule of coupling yet, so just this mail is judged to be legitimate mail.

In LIST SERVER, its main composition module and function are as follows:

(1) secondary judge module.The number of times of being submitted to by each mail server according to characteristic vector (being multiplicity) and each mail server are judged the accuracy of spam, it is carried out second time judge, the characteristic vector that filters out high accuracy is gathered.

(2) regular generation module.Set is reconstructed to characteristic vector, generates filtering rule, and stores in the rule base of LIST SERVER.

(3) regular release module.According to the cycle of setting, find the update rule in the rule base fast, and it is issued in each mail service, realize the quick real-time update of mail server local rules repository

Developed prototype system based on the present invention, comprise above-described each function sub-modules, from implementation result, the present invention can be when carrying out the large scale rubbish mail interception, improve the accuracy of Spam filtering, simultaneity factor has the tachysynthesis ability to the New-type refuse vehicle mail, and dissimilar spams is had adaptability, can filter such as being the spam of content with Web page or leaf or picture.

Claims

1. method for preventing collaboration type junk mail is characterized in that this method is specific as follows:

In the 5th step, LIST SERVER is published to newly-generated filtering rule in the rule base of each mail server and upgrades, and each mail server utilizes these rules of having upgraded to carry out Spam filtering when receiving new Email;

Wherein:

λ ₁: weights, value are between 0 and 1, and this value is regulated according to real system;

λ ₂: weights, value are between 0 and 1, and this value is regulated according to real system.

2. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the mail sample collection is meant because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, this need add up the distribution of an envelope mail in the honey jar set, promptly receives account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather.

3. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the feature of extracting sample is meant that the spam sample to collecting carries out feature extraction from honey jar account set, form with characteristic vector is represented sample, so that follow-up storage and calculating; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;

The characteristic vector form of described sample is as follows:

F＝<SA，SIP，FP>

4. method for preventing collaboration type junk mail according to claim 1 is characterized in that: the process that one-level is judged is to count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;

The calculating process that one-level is judged is:

Described mail server is submitted to LIST SERVER to the vector set that generates by unified interface, open up special buffering area in the LIST SERVER, be used for storing pending set of eigenvectors, when LIST SERVER receives the set of eigenvectors that certain mail server sends, temporarily it is deposited in the buffering area of system, when the characteristic vector set that receives reaches some, just it is carried out secondary and judge;

Wherein, m represents different characteristic vector numbers, and n represents mail server quantity, R _pThe multiplicity matrix of representation feature vector p, r _PqThe multiplicity of representation feature vector p in mail server q, S is the accuracy matrix.

5. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: secondary is judged the accuracy of promptly discerning spam according to each mail server, and the multiplicity of characteristic vector in each mail service, judgement is united in realization, calculate the confidence level of spam feature, eliminate the lower characteristic vector of confidence level;

It is that LIST SERVER utilizes the accuracy matrix of each mail server and the multiplicity matrix of characteristic vector to carry out computing that secondary is judged, thereby generates the confidence level matrix of each characteristic vector: wherein the accuracy matrix is:

S ^T＝[s ₁ s ₂ … s _n]

s _iThe accuracy size of expression server i identification spam, the multiplicity matrix is:

A _{M * n}In, m represents different characteristic vector numbers, n represents mail server quantity, R _pThe multiplicity matrix of representation feature vector p, r _PqThe multiplicity of representation feature vector p in mail server q.

6. method for preventing collaboration type junk mail according to claim 1 is characterized in that: the filtering rule described in the 4th step comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.

7. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the renewal process in the 5th step refers to read the filtering rule that has upgraded from rule base, publish to then in each mail server, realize sharing and upgrading of mail server filtering rule, reach the purpose of collaboration type anti-spam.

8. method for preventing collaboration type junk mail according to claim 1 is characterized in that: when in the 5th step new Email being carried out Spam filtering, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.