CN101494546B - Method for preventing collaboration type junk mail - Google Patents

Method for preventing collaboration type junk mail Download PDF

Info

Publication number
CN101494546B
CN101494546B CN2009100286953A CN200910028695A CN101494546B CN 101494546 B CN101494546 B CN 101494546B CN 2009100286953 A CN2009100286953 A CN 2009100286953A CN 200910028695 A CN200910028695 A CN 200910028695A CN 101494546 B CN101494546 B CN 101494546B
Authority
CN
China
Prior art keywords
mail
spam
account
characteristic vector
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100286953A
Other languages
Chinese (zh)
Other versions
CN101494546A (en
Inventor
曹玖新
罗军舟
林加镇
姚燚
刘永生
孙学胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN2009100286953A priority Critical patent/CN101494546B/en
Publication of CN101494546A publication Critical patent/CN101494546A/en
Application granted granted Critical
Publication of CN101494546B publication Critical patent/CN101494546B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A cooperative spam prevention method mainly solves the existing e-mail security problem of Internet, relating to relevant technologies such as honeypot principle, spam filtering, and the like. The method applies distributed network structure, and comprises an e-mail server and a directory server. In the system structure, the e-mail server is the main body for filtering spam, and collects spam samples simultaneously. To improve the timeliness of sample collection, the characteristics of the collected samples are extracted and initial judgment (first level judgment) is carried out, and then 'possible spam' characteristics submitted to the directory server. The directory server carries out second filtering (second level judgment) to the received 'possible spam' characteristics, generates a spam filtering rule and stores the rule into a rule base, and then releases the updated rule to the local rule base of each e-mail server through a unified interface, thus realizing the sharing and updating of the filtering rules and achieving the purpose of cooperative spam prevention.

Description

Method for preventing collaboration type junk mail
Technical field
The present invention is a kind of correlation techniques such as honey jar principle and Spam filtering of having utilized, and relates to particularly safety of electronic mail field of network security.
Background technology
And in recent years, new variation has appearred in the mode of propagation of spam and content, and its harm is serious day by day: taken a large amount of Internet resources, reduced network operating efficiency; Waste a large amount of time of mail user; Become virus, the main route of transmission of wooden horse and phishing, serious threat network security; In addition, spam is propagated flame, and society is caused serious negative effect.Yet existing Spam filtering technology but can not be tackled this situation well.Further improve the Internet and resist the spam risk ability, satisfy the demand of user better, become urgent task in the network safety filed anti-spam.
The Spam filtering technology that generally adopts mainly contains at present:
One, the filtering technique of black and white lists.Black and white lists tabulation of this Technology Need user manual maintenance, and carry out Spam filtering on this basis.
Two,, based on the filtering technique of adding up.Existing technology based on statistics, for example SVMs (learn and conclude the statistical law of sample, and on this basis new mail is carried out identification and classification by Support VectorMachine, SVM) method and Bayes method etc.
Three, rule-based filtering technique.Existing rule-based method obtains explicit rule by the training to great amount of samples, utilizes these rules that spam is filtered again.Rule-based method mainly contains the Ripper method, traditional decision-tree and Boosting method etc.
There is very big deficiency in above Spam filtering technology.At first lack collaborative,, lack information exchange, do not form system, therefore can't cooperate mutually, carry out anti-spam on a large scale because each mail server filters spam independently; Secondly, the computation complexity height, in order to obtain filtering rule or spam feature, must be to a large amount of sample training, the sample drawn content, filtering rule is also complicated, causes higher computational complexity.Once more, lack real-time, filtering rule or spam feature update cycle are longer, and renewal speed is slow, can't filter up-to-date spam, has hysteresis quality; Lack adaptability at last, the form of spam has produced new variation on the network at present, and having occurred with non-textual formats such as pictures is the spam of content.And therefore existing content-based filter method can't filter the spam of propagating with the picture form to analyze the content of text of mail, causes system a large amount of failing to judge to occur, and has reduced the filtration accuracy rate.
Summary of the invention
Technical problem: the present invention is directed to the deficiency of conventional garbage filtrating mail technology, propose a kind of new collaboration type rubbish mail filtering method.This method has collaborative, and tachysynthesis and adaptive characteristics can be carried out catching rubbish mail on a large scale under internet environment.The present invention integrates each mail server by introducing LIST SERVER.In mail server, utilize the honey jar technology to carry out sample collection, judge that through one-level the back is submitted to LIST SERVER to the characteristic vector of sample, directory service carries out judging the second time that to these characteristic vectors the screening back generates filtering rule, then the filtering rule that has upgraded is published in other mail server, realizes preventing collaboration type junk mail.
Technical scheme: method for preventing collaboration type junk mail of the present invention is specific as follows:
The first step, mail server reads the relevant information of email account, and according to honey jar account judgement schematics each account is marked in conjunction with these information, then according to from big to small order in the accounts database that calculates resulting score value writing system, updated account grade form and according to the honey jar selection algorithm, thus from the email account of system, select the account of some to generate the honey jar set as the honey jar account;
Second step, according to the behavioural characteristic of spam, the regular feature of from these honey jars account set, carrying out the mail sample collection and extracting sample, the composition characteristic vector is also represented sample set with these characteristic vectors;
In the 3rd step, utilizing these sample characteristics vectors multiplicity in the honey jar set is the number of times that sample occurs, and it is judged to be that one-level is judged for the first time, thereby tentatively improves the accuracy of sample;
In the 4th step, mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level, and LIST SERVER carries out judging the second time to be that secondary is judged to these characteristic vectors again, filters out the higher characteristic vector of accuracy and generates filtering rule;
In the 5th step, LIST SERVER is published to newly-generated filtering rule in the rule base of each mail server and upgrades, and each mail server utilizes these rules of having upgraded to carry out Spam filtering when receiving new Email.
The process that generates the honey jar set is initial honey jar set at first to be set for empty, account's grade form in system's reading database then, preferential select the higher account of mark and these accounts are added in the into honey jar set, behind each increase honey jar account, the spam quantity in the pair set is added up; Because the spam quantity in the server is limited, therefore along with the continuous increase of gathering, the spam quantity that collects will be tending towards definite value from set, and the spam increment after increasing the honey jar account in the set is during less than the threshold value determined, just can determine that final honey jar gathers;
Carrying out the honey jar account when selecting, its judgement schematics of determining candidate's honey jar is:
V ( t , Δt ) = ( λ 1 * S 1 ( t - Δt ) H 1 ( t - Δt ) + ( 1 - λ 1 ) * S 2 ( Δt ) H 2 ( Δt ) ) * λ 2 + S 2 ( Δt ) * ( 1 - λ 2 )
Wherein:
V: certain account's score in the system, expression this account is chosen as the possibility size of honey jar; The value of V is big more, and then to be chosen as honey jar account's possibility big more for this account, otherwise more little;
T: time variable, the moment point that the expression algorithm is carried out;
Δ t: time interval variable, the time interval of twice execution algorithm before and after the expression;
S 1(t-Δ t): the spam history sum that the expression account received constantly at (t-Δ t);
S 2(Δ t): the spam sum that the expression account received in the Δ t time period recently;
H 1(t-Δ t): the historical sum of legitimate mail that the expression account received constantly at (t-Δ t);
H 2(Δ t): the legitimate mail sum that the expression account received in the Δ t time period recently;
λ 1: weights, value is between 0 and 1.This value can be regulated according to real system;
λ 2: weights, value is between 0 and 1.This value can be regulated according to real system.
The mail sample collection is meant because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, this need add up the distribution of an envelope mail in the honey jar set, promptly receives account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather;
The feature of extracting sample is meant that the spam sample to collecting carries out feature extraction, represents sample with the form of characteristic vector, so that follow-up storage and calculating from honey jar account set; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;
The characteristic vector form of described sample is as follows:
F=<SA,SIP,FP>
The meaning of each component is as shown in the table among the characteristic vector F:
The component name The meaning of component
SA Sender's addresses of items of mail, i.e. Return-Path part in the mail header information
SIP Mail sources IP, first IP address in the mail header information in last Received field
FP The finger print information of Mail Contents
The process that one-level is judged is to count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;
The calculating process that one-level is judged is:
C = A m &times; n &times; S = R 1 R 2 . . . R m &times; S = R 1 &times; S R 2 &times; S . . . R m &times; S = &Sigma; i = 1 n r 1 i &times; s 1 i &Sigma; i = 1 n r 2 i &times; s 2 i . . . &Sigma; i = 1 n r mi &times; s mi = c 1 c 2 . . . c m
Wherein C is the confidence level matrix, and LIST SERVER carries out programmed screening according to the confidence level matrix to feature;
Described mail server is submitted to LIST SERVER to the vector set that generates by unified interface, open up special buffering area in the LIST SERVER, be used for storing pending set of eigenvectors, when LIST SERVER receives the set of eigenvectors that certain mail server sends, temporarily it is deposited in the buffering area of system, when the characteristic vector set that receives reaches some, just it is carried out secondary and judge.
Secondary is judged the accuracy of promptly discerning spam according to each mail server, and the multiplicity of characteristic vector in each mail service, realizes the associating judgement, calculates the confidence level of spam feature, eliminates the lower characteristic vector of confidence level;
It is that LIST SERVER utilizes the accuracy matrix of each mail server and the multiplicity matrix of characteristic vector to carry out computing that secondary is judged, thereby generates the confidence level matrix of each characteristic vector; Wherein the accuracy matrix is:
S T=[s 1 s 2 ... s n]
s iThe accuracy size of expression server i identification spam.The multiplicity matrix is:
A m &times; n = R 1 R 2 . . . R m = r 11 r 12 . . . r 1 n r 21 r 22 . . . r 2 n . . . . . . . . . . . . r m 1 r m 2 . . . r mn
A M * nIn, m represents different characteristic vector numbers, n represents mail server quantity, R pThe multiplicity matrix of representation feature vector p, r PqThe multiplicity of representation feature vector p in mail server q;
Filtering rule described in the 4th step comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.
Renewal process in the 5th step refers to read the filtering rule that has upgraded from rule base, publish to then in each mail server, realizes sharing and upgrading of mail server filtering rule, reaches the purpose of collaboration type anti-spam.
When in the 5th step new Email being carried out Spam filtering, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.
Beneficial effect: characteristics of the present invention are to introduce LIST SERVER in distributed environment, each mail server is integrated, and designed the two-stage judgment mechanism and " doubtful spam " sample is judged and screened, and have improved regular accuracy.The present invention has collaborative, and tachysynthesis and adaptive special advantage can be carried out catching rubbish mail on a large scale under internet environment.The result who surveys according to test shows that the present invention is being in the leading level in the world aspect the spam cooperation strick precaution.
Description of drawings
Fig. 1 selects flow chart for honey jar account of the present invention;
Fig. 2 is a preventing collaboration type junk mail system schematic diagram of the present invention;
Fig. 3 is a Spam filtering flow chart of the present invention.
Embodiment
Method of the present invention further describe into:
A. mail server reads the relevant information of email account, and according to honey jar account judgement schematics each account is marked in conjunction with these information, then according to from big to small order in the accounts database that calculates resulting score value writing system, the updated account grade form;
B. determine the honey jar set: in this stage, initial honey jar set at first is set for empty, account's grade form in system's reading database then, preferential select the higher account of mark and these accounts are added in the into honey jar set, behind each increase honey jar account, the spam quantity in the pair set is added up; Because the spam quantity in the server is limited, therefore along with the continuous increase of gathering, the spam quantity that collects will be tending towards definite value from set, and the spam increment after increasing the honey jar account in the set is during less than the threshold value determined, just can determine that final honey jar gathers;
C. sample collection: because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, need the distribution of statistics one envelope mail in the honey jar set, promptly receive account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather;
D. feature extraction: the spam sample that collects from honey jar account set is carried out feature extraction, represent sample with the form of characteristic vector, so that follow-up storage and calculating; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;
E. one-level is judged: at first count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;
F. secondary is judged: according to the accuracy of each mail server identification spam, and the multiplicity of characteristic vector in each mail service, realize the associating judgement, calculate the confidence level of spam feature, eliminate the lower characteristic vector of confidence level;
G. generate filtering rule and filtering rule is distributed to each mail server, the filtering rule that each mail server utilization has been upgraded carries out Spam filtering.
Carrying out the honey jar account when selecting, its judgement schematics of determining candidate's honey jar is:
V ( t , &Delta;t ) = ( &lambda; 1 * S 1 ( t - &Delta;t ) H 1 ( t - &Delta;t ) + ( 1 - &lambda; 1 ) * S 2 ( &Delta;t ) H 2 ( &Delta;t ) ) * &lambda; 2 + S 2 ( &Delta;t ) * ( 1 - &lambda; 2 )
Wherein:
V: certain account's score in the system, expression this account is chosen as the possibility size of honey jar; The value of V is big more, and then to be chosen as honey jar account's possibility big more for this account, otherwise more little;
T: time variable, the moment point that the expression algorithm is carried out;
Δ t: time interval variable, the time interval of twice execution algorithm before and after the expression;
S 1(t-Δ t): the spam history sum that the expression account received constantly at (t-Δ t);
S 2(Δ t): the spam sum that the expression account received in the Δ t time period recently;
H 1(t-Δ t): the historical sum of legitimate mail that the expression account received constantly at (t-Δ t);
H 2(Δ t): the legitimate mail sum that the expression account received in the Δ t time period recently;
λ 1: weights, value is between 0 and 1.This value can be regulated according to real system;
λ 2: weights, value is between 0 and 1.This value can be regulated according to real system.
The characteristic vector form of sample is as follows:
F=<SA,SIP,FP>
The meaning of each component is as shown in the table among the characteristic vector T:
The component name The meaning of component
SA Sender's addresses of items of mail, i.e. Return-Path part in the mail header information
SIP Mail sources IP, first IP address in the mail header information in last Received field
FP The finger print information of Mail Contents
For the finger print information that obtains Mail Contents be characteristic vector the FP component, this paper adopts the improvement version of the Nilsimsa method of abstracting of open source code.Nilsimsa is actually a kind of hash algorithm, and it has bigger advantage aspect the similitude calculating of mail.
Described mail server is submitted to LIST SERVER to the vector set that generates by unified interface, open up special buffering area in the LIST SERVER, be used for storing pending set of eigenvectors, when LIST SERVER receives the set of eigenvectors that certain mail server sends, temporarily it is deposited in the buffering area of system, when the characteristic vector set that receives reaches some, just it is carried out secondary and judge.
It is that LIST SERVER utilizes the accuracy matrix of each mail server and the multiplicity matrix of characteristic vector to carry out computing that secondary is judged, thereby generates the confidence level matrix of each characteristic vector; Wherein the accuracy matrix is:
S T=[s 1 s 2 ... s n]
s iThe accuracy size of expression server i identification spam.The multiplicity matrix is:
A m &times; n = R 1 R 2 . . . R m = r 11 r 12 . . . r 1 n r 21 r 22 . . . r 2 n . . . . . . . . . . . . r m 1 r m 2 . . . r mn
A M * nIn, m represents different characteristic vector numbers, n represents mail server quantity, R pThe multiplicity matrix of representation feature vector p, r PqThe multiplicity of representation feature vector p in mail server q.The calculating process that one-level is judged is:
C = A m &times; n &times; S = R 1 R 2 . . . R m &times; S = R 1 &times; S R 2 &times; S . . . R m &times; S = &Sigma; i = 1 n r 1 i &times; s 1 i &Sigma; i = 1 n r 2 i &times; s 2 i . . . &Sigma; i = 1 n r mi &times; s mi = c 1 c 2 . . . c m
Wherein C is the confidence level matrix, and LIST SERVER carries out programmed screening according to the confidence level matrix to feature.
Filtering rule comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.
From rule base, read the filtering rule that has upgraded, publish to then in each mail server, realize sharing and upgrading of mail server filtering rule, reach the purpose of collaboration type anti-spam.
When mail server is received the new Email of an envelope, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.
As shown in Figure 2, deployment of the present invention need make up a network of being made up of LIST SERVER and some mail servers.In mail server, its main composition module and function are as follows:
(1) sample collection module.By sample collection, feature extraction and one-level judge that three submodules constitute.Wherein, the historical information of sample collection submodule statistics mail account, the email account of selecting some from server utilize the mass-sending feature of spam regularly to gather doubtful spam sample from the honey jar account aggregation as the honey jar account; The feature extraction submodule is header information and the Mail Contents finger print information (rather than analyzing Mail Contents itself) by extracting doubtful spam mainly, generates the characteristic vector of lightweight; One-level judges that submodule then according to the multiplicity of sample in the honey jar account aggregation, carries out judging the first time and screening to the feature that extracts, and is submitted to LIST SERVER then.
(2) Policy Updates module.The up-to-date filtering rule of reception LIST SERVER issue also is dumped in the local rules repository.
(3) Spam filtering module.From the mail buffer queue, receive mail, extract its characteristic vector, the rule whether coupling is arranged in the retrieval local rules repository, if the match is successful, then this mail is judged to be spam, otherwise, set a cache-time, and deposit this mail in the user buffering district, if in cache-time, occur the rule of coupling yet, so just this mail is judged to be legitimate mail.
In LIST SERVER, its main composition module and function are as follows:
(1) secondary judge module.The number of times of being submitted to by each mail server according to characteristic vector (being multiplicity) and each mail server are judged the accuracy of spam, it is carried out second time judge, the characteristic vector that filters out high accuracy is gathered.
(2) regular generation module.Set is reconstructed to characteristic vector, generates filtering rule, and stores in the rule base of LIST SERVER.
(3) regular release module.According to the cycle of setting, find the update rule in the rule base fast, and it is issued in each mail service, realize the quick real-time update of mail server local rules repository
Developed prototype system based on the present invention, comprise above-described each function sub-modules, from implementation result, the present invention can be when carrying out the large scale rubbish mail interception, improve the accuracy of Spam filtering, simultaneity factor has the tachysynthesis ability to the New-type refuse vehicle mail, and dissimilar spams is had adaptability, can filter such as being the spam of content with Web page or leaf or picture.

Claims (8)

1. method for preventing collaboration type junk mail is characterized in that this method is specific as follows:
The first step, mail server reads the relevant information of email account, and according to honey jar account judgement schematics each account is marked in conjunction with these information, then according to from big to small order in the accounts database that calculates resulting score value writing system, updated account grade form and according to the honey jar selection algorithm, thus from the email account of system, select the account of some to generate the honey jar set as the honey jar account;
Second step, according to the behavioural characteristic of spam, the regular feature of from these honey jars account set, carrying out the mail sample collection and extracting sample, the composition characteristic vector is also represented sample set with these characteristic vectors;
In the 3rd step, utilizing these sample characteristics vectors multiplicity in the honey jar set is the number of times that sample occurs, and it is judged to be that one-level is judged for the first time, thereby tentatively improves the accuracy of sample;
In the 4th step, mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level, and LIST SERVER carries out judging the second time to be that secondary is judged to these characteristic vectors again, filters out the higher characteristic vector of accuracy and generates filtering rule;
In the 5th step, LIST SERVER is published to newly-generated filtering rule in the rule base of each mail server and upgrades, and each mail server utilizes these rules of having upgraded to carry out Spam filtering when receiving new Email;
The process that generates the honey jar set is initial honey jar set at first to be set for empty, account's grade form in system's reading database then, preferential select the higher account of mark and these accounts are added in the into honey jar set, behind each increase honey jar account, the spam quantity in the pair set is added up; Because the spam quantity in the server is limited, therefore along with the continuous increase of gathering, the spam quantity that collects will be tending towards definite value from set, and the spam increment after increasing the honey jar account in the set is during less than the threshold value determined, just can determine that final honey jar gathers;
Carrying out the honey jar account when selecting, its judgement schematics of determining candidate's honey jar is:
Figure FSB00000341698800011
Wherein:
V: certain account's score in the system, expression this account is chosen as the possibility size of honey jar; The value of V is big more, and then to be chosen as honey jar account's possibility big more for this account, otherwise more little;
T: time variable, the moment point that the expression algorithm is carried out;
Δ t: time interval variable, the time interval of twice execution algorithm before and after the expression;
S 1(t-Δ t): the spam history sum that the expression account received constantly at (t-Δ t);
S 2(Δ t): the spam sum that the expression account received in the Δ t time period recently;
H 1(t-Δ t): the historical sum of legitimate mail that the expression account received constantly at (t-Δ t);
H 2(Δ t): the legitimate mail sum that the expression account received in the Δ t time period recently;
λ 1: weights, value are between 0 and 1, and this value is regulated according to real system;
λ 2: weights, value are between 0 and 1, and this value is regulated according to real system.
2. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the mail sample collection is meant because spam has the behavioural characteristic of mass-sending, one envelope spam often appears among some honey jar accounts simultaneously, utilize this feature to carry out sample collection, this need add up the distribution of an envelope mail in the honey jar set, promptly receives account's quantity of this envelope mail in the set simultaneously; If account's quantity of receiving same envelope mail in the set greater than specified threshold value, so just can be differentiated this envelope mail for " doubtful " spam and gather.
3. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the feature of extracting sample is meant that the spam sample to collecting carries out feature extraction from honey jar account set, form with characteristic vector is represented sample, so that follow-up storage and calculating; Employing is at mail header and Mail Contents finger print information, but not the feature extracting method of Mail Contents itself generates the characteristic vector of lightweight;
The characteristic vector form of described sample is as follows:
F=<SA,SIP,FP>
The meaning of each component is as shown in the table among the characteristic vector F:
The component name The meaning of component SA Sender's addresses of items of mail, i.e. Return-Path part in the mail header information SIP Mail sources IP, first IP address in the mail header information in last Received field FP The finger print information of Mail Contents
4. method for preventing collaboration type junk mail according to claim 1 is characterized in that: the process that one-level is judged is to count the multiplicity of each characteristic vector in set, if multiplicity then keeps this characteristic vector greater than certain threshold value that presets; The existing characteristic vector of deletion feature database in set simultaneously in the feature database of the final set of eigenvectors writing system that generates, is finished and is upgraded operation then, and mail server is submitted to LIST SERVER to the characteristic vector of judging through one-level;
The calculating process that one-level is judged is:
Figure FSB00000341698800021
Wherein C is the confidence level matrix, and LIST SERVER carries out programmed screening according to the confidence level matrix to feature;
Described mail server is submitted to LIST SERVER to the vector set that generates by unified interface, open up special buffering area in the LIST SERVER, be used for storing pending set of eigenvectors, when LIST SERVER receives the set of eigenvectors that certain mail server sends, temporarily it is deposited in the buffering area of system, when the characteristic vector set that receives reaches some, just it is carried out secondary and judge;
Wherein, m represents different characteristic vector numbers, and n represents mail server quantity, R pThe multiplicity matrix of representation feature vector p, r PqThe multiplicity of representation feature vector p in mail server q, S is the accuracy matrix.
5. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: secondary is judged the accuracy of promptly discerning spam according to each mail server, and the multiplicity of characteristic vector in each mail service, judgement is united in realization, calculate the confidence level of spam feature, eliminate the lower characteristic vector of confidence level;
It is that LIST SERVER utilizes the accuracy matrix of each mail server and the multiplicity matrix of characteristic vector to carry out computing that secondary is judged, thereby generates the confidence level matrix of each characteristic vector: wherein the accuracy matrix is:
S T=[s 1 s 2 … s n]
s iThe accuracy size of expression server i identification spam, the multiplicity matrix is:
Figure FSB00000341698800031
A M * nIn, m represents different characteristic vector numbers, n represents mail server quantity, R pThe multiplicity matrix of representation feature vector p, r PqThe multiplicity of representation feature vector p in mail server q.
6. method for preventing collaboration type junk mail according to claim 1 is characterized in that: the filtering rule described in the 4th step comprises finger print information and two parts of blacklist list of Mail Contents, and these two parts can extract from characteristic vector.
7. method for preventing collaboration type junk mail according to claim 1, it is characterized in that: the renewal process in the 5th step refers to read the filtering rule that has upgraded from rule base, publish to then in each mail server, realize sharing and upgrading of mail server filtering rule, reach the purpose of collaboration type anti-spam.
8. method for preventing collaboration type junk mail according to claim 1 is characterized in that: when in the 5th step new Email being carried out Spam filtering, at first extract the characteristic vector of this envelope mail; Whether the buffering area of searching system exists the characteristic vector that matches then, if exist, then this envelope mail is judged as spam, otherwise the transmitting terminal host information of retrieving this envelope mail is whether in blacklist list, if match blacklist then be judged as spam; When retrieving less than finger print information that is complementary or blacklist, system will drop into mail queue with this mail according to the longest time that can be trapped in the formation of the mail that presets, and judge according to above-mentioned flow process again at interval at a fixed time; If in the longest residence time, the filtering rule of coupling appears in the system not yet, and so just this mail is judged to be legitimate mail, and is delivered to corresponding account.
CN2009100286953A 2009-01-05 2009-01-05 Method for preventing collaboration type junk mail Expired - Fee Related CN101494546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100286953A CN101494546B (en) 2009-01-05 2009-01-05 Method for preventing collaboration type junk mail

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100286953A CN101494546B (en) 2009-01-05 2009-01-05 Method for preventing collaboration type junk mail

Publications (2)

Publication Number Publication Date
CN101494546A CN101494546A (en) 2009-07-29
CN101494546B true CN101494546B (en) 2011-04-20

Family

ID=40924966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100286953A Expired - Fee Related CN101494546B (en) 2009-01-05 2009-01-05 Method for preventing collaboration type junk mail

Country Status (1)

Country Link
CN (1) CN101494546B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101778055B (en) * 2009-12-31 2013-03-13 卓望数码技术(深圳)有限公司 Message processing method and network entity
CN102419777B (en) * 2012-01-10 2013-10-02 凤凰在线(北京)信息技术有限公司 System and method for filtering internet image advertisements
CN103078753B (en) * 2012-12-27 2016-07-13 华为技术有限公司 The processing method of a kind of mail, device and system
CN107294834A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for recognizing spam
CN107453973B (en) * 2016-05-31 2021-04-13 阿里巴巴集团控股有限公司 Method and device for discriminating identity characteristics of e-mail sender
CN107171944B (en) * 2017-06-27 2020-06-16 北京二六三企业通信有限公司 Junk mail identification method and device
CN107888484A (en) * 2017-11-29 2018-04-06 北京明朝万达科技股份有限公司 A kind of email processing method and system
CN110781429A (en) * 2019-09-24 2020-02-11 支付宝(杭州)信息技术有限公司 Internet data detection method, device, equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434390A (en) * 2003-02-28 2003-08-06 上海蓝飞通信设备有限公司 Method for preventing rubbish E-mail
CN1564551A (en) * 2004-03-16 2005-01-12 张晴 Method of carrying out preventing of refuse postal matter
CN1578357A (en) * 2003-07-15 2005-02-09 乐金电子(中国)研究开发中心有限公司 Garbage mail intercepting method for mobile communication terminal
CN1614607A (en) * 2004-11-25 2005-05-11 中国科学院计算技术研究所 Filtering method and system for e-mail refuse

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434390A (en) * 2003-02-28 2003-08-06 上海蓝飞通信设备有限公司 Method for preventing rubbish E-mail
CN1578357A (en) * 2003-07-15 2005-02-09 乐金电子(中国)研究开发中心有限公司 Garbage mail intercepting method for mobile communication terminal
CN1564551A (en) * 2004-03-16 2005-01-12 张晴 Method of carrying out preventing of refuse postal matter
CN1614607A (en) * 2004-11-25 2005-05-11 中国科学院计算技术研究所 Filtering method and system for e-mail refuse

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林加镇等.一种新的垃圾邮件样本采集方法.《东南大学学报》.2008,第38卷(第2期),244-248. *

Also Published As

Publication number Publication date
CN101494546A (en) 2009-07-29

Similar Documents

Publication Publication Date Title
CN101494546B (en) Method for preventing collaboration type junk mail
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
CN101257671B (en) Method for real time filtering large scale rubbish SMS based on content
CN101877837A (en) Method and device for short message filtration
CN101699432B (en) Ordering strategy-based information filtering system
CN102413076A (en) Spam mail judging system based on behavior analysis
CN1716293A (en) Incremental anti-spam lookup and update service
CN102024045B (en) Information classification processing method, device and terminal
Katirai et al. Filtering junk e-mail
CN102255922A (en) Intelligent multilevel junk email filtering method
CN102831248A (en) Network hotspot mining method and network hotspot mining device
CN1863170A (en) Method for processing junk E-mail and computer readable memory medium
CN101330473A (en) Method and apparatus for filtrating network rubbish information supported by multiple protocols
CN101784022A (en) Method and system for filtering and classifying short messages
CN1517928A (en) Technology truss for allowing integrated anti-peddle information
CN103136266A (en) Method and device for classification of mail
CN102456022A (en) Short message management method and system
CN102790752A (en) Fraud information filtering system and method on basis of feature identification
US20140115067A1 (en) Method and system for email organization
CN102404249A (en) Method and device for filtering junk emails based on coordinated training
CN105871887A (en) Client-side based personalized E-mail filtering system and method
CN103108290A (en) Short message handling method and device
Fujiki et al. Identification of bursts in a document stream
CN102098638A (en) Short message sorting method and device, and terminal
CN103873348A (en) E-mail filter method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110420

Termination date: 20140105