CN108763449A

CN108763449A - A kind of Chinese key rule generating method of Spam filtering

Info

Publication number: CN108763449A
Application number: CN201810521174.0A
Authority: CN
Inventors: 张凌; 张启华; 张晶; 徐傲雪; 黄康泉
Original assignee: South China University of Technology SCUT; CERNET Corp
Current assignee: South China University of Technology SCUT; CERNET Corp
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-06

Abstract

The invention discloses a kind of Chinese key rule generating methods of Spam filtering, this method includes mainly concentrating acquisition keyword candidate word, feature extraction to obtain keyword, acquisition keyword rule triggering situation from mail, assigning four steps of score value for keyword rule, compare current techniques, method proposed by the present invention improves keyword feature extracting method, the influence of universal word is reduced in conjunction with the feature extracting method of word frequency and document frequency, using neural network algorithm computation rule score value, compare the expense that genetic algorithm reduces study.The present invention solves current Chinese keyword rule timeliness deficiency, and can generate the keyword rule that best suit user characteristics for the mail data collection of definition and the offer of spam according to a specific user group.

Description

A kind of Chinese key rule generating method of Spam filtering

Technical field

The present invention relates to the technical fields of internet security, refer in particular to a kind of Chinese key rule of Spam filtering Then generation method.

Background technology

With the high speed development of the development of internet, especially mobile Internet, network communication means are further abundant, still The Email service most wide as the Internet, applications still remains its irreplaceability.Nowadays the rubbish postal spread unchecked on network Part can waste a large amount of Internet resources, increase the time cost that user handles mail, the propagation even meeting of some viral spams Directly contribute huge economic loss.By the research of various countries' researcher's many decades, have accumulated now ripe and abundant Anti-spam technologies, include mainly sending that the technology that is detected to sender's identity of principle is for example black, white list based on mail, SPF is detected, Honeypot Techniques etc.；Filtering technique based on user behavior such as con current control, FREQUENCY CONTROL etc.；Based on Mail Contents mistake The method of filter is realized in conjunction with machine learning and knowledge of statistics by being based on probability and rule-based two classes method.It is rule-based Spam filtering solution of increasing income in SpamAssassin have better effect.Have in the rule of Spam Assassin One kind is keyword rule, and the operation principle of keyword rule is scan mail head and mail body, and whether inspection wherein includes Common words in spam, each keyword rule are endowed specific weight score, and SpamAssassin officials are only It safeguards English keyword rule, therefore cannot check the everyday expressions for carrying out in Chinese email whether to include spam, CCERT in 2004 has developed a Chinese rules using word frequency statistics and genetic algorithm, but is just no longer updated from 2006, Over time, the common keyword of spam can also change, and rule set mentioned above is deposited in timeliness In deficiency, CCERT, using word frequency statistics, is chosen at spam and concentrates the highest vocabulary of word frequency when extracting keyword feature, Some can be appeared in the key that the common words in spam and surface mail are regarded as spam simultaneously by this method Word, it is unreasonable when obvious in this way, and CCERT is rule calculating point using the genetic algorithm that old edition SpamAssassin is provided Value, since 3.4 versions, SpamAssassin has been updated to neural network algorithm, compares genetic algorithm neural network algorithm energy Learning time expense is enough efficiently reduced, in addition user often has different standards for the judgement of spam.In summary The problem of, propose that a kind of solution that the specific mail collection of basis generates Chinese key rule is of great significance.

Invention content

The shortcomings that it is an object of the invention to overcome the prior art and deficiency, it is proposed that a kind of Chinese of Spam filtering Keyword rule generating method can be automatically generated according to the specific mail data set that user provides and be best suited in user demand Literary keyword rule, in rule-based Spam filtering scheme.

To achieve the above object, technical solution provided by the present invention is：A kind of Chinese key of Spam filtering Rule generating method, this method to giving mail data collection by carrying out the mail that data prediction obtains mail in the data set All vocabulary of head and mail body portion are as keyword candidate word, by the feature extraction side for combining word frequency and document frequency Method selectes keyword, and then carries out filtrating mail to above-mentioned mail data collection using the keyword rule and advised to obtain keyword It then in the triggering situation of spam and normal email, and uses this triggering situation as the input of neural network algorithm, leads to It crosses and trains neural network to be restrained until filter effect with machine descending method, convert the weight that training obtains to the score of rule, Finally obtained rule can be applied in the solution of rule-based filtrating mail；It specifically includes following steps：

1) mail data collection is pre-processed to obtain keyword candidate by mail screening, mail parsing, Chinese word segmentation Set of words；

2) to whole glossary statistic word frequency of candidate word set, document frequency, by comparing document frequency after first comparing word frequency The feature extraction of rate selects keyword from candidate word set；

3) it collects mail data and concentrates the keyword triggering situation for often sealing mail, and format triggering situation data；

4) score value is assigned to keyword rule by neural network algorithm according to above-mentioned keyword triggering situation.

In step 1), the mail screening refers to the pure English file rejected mail data and concentrated, the mail parsing It is to realize that carrying out parsing to Mail Contents based on RFC822 and MIME agreements is partitioned into different part selection mail head and mail The part of body, the Chinese word segmentation are segmented to the content of text of mail head and mail body using Chinese word segmentation tool.

In step 2), Feature Selection is done in conjunction with the method for word frequency and document frequency, determines keyword, including following step Suddenly：

2.1) word frequency, document frequencies are counted, word frequency refers to the number that a word occurs in a document, and document frequency is pointed out to show certain The document number of a candidate word；

2.2) the highest N number of word of word frequency in spam is chosen；

2.3) according to formula spam (wi)/spam (wi)+ham (wi)>T% filters out keyword, meets the wi of the formula An as keyword, wherein wi indicate some word in the highest N number of set of words of word frequency, spam (wi) indicate to include word wi Spam number, ham (wi) indicates include the normal email number of word wi, and T% indicates some threshold value being arranged.

In step 3), collects mail data using Open-Source Tools SpamAssassin and the keyword of often envelope mail is concentrated to touch Heat condition, and triggering situation data are formatted, include the following steps：

3.1) strictly all rules built in SpamAssassin are disabled, bayesian algorithm is deactivated, eliminates the influence of Else Rule, Add the keyword rule generated in step 2)；

3.2) the mass-check scripts that SpamAssassin is provided are used to call every part of mail in training set SpamAssassin is filtered, and the strictly all rules that every envelope mail is triggered then are recorded in diary；

3.3) after-treatment is carried out to journal file, by processing structure structuring.

The use of neural network algorithm is the keyword rule tax score value generated in step 2), including following in step 4) Step：

4.1) non-spam email is subjected to redundancy duplication first, the formula for adding the number of non-spam email is 1+ (number_of_test_hit) * ham_preference, ham_preference input for parameter, are defaulted as 2.0, Number_of_test_hit refers to the mail triggers how many rule；

4.2) it is that the weight in particular range is randomly assigned per rule, range is by regular the case where triggering mail number It determines；

4.3) it is trained using neural network algorithm, num_epochs rear stopping of iteration, num_epochs refers to nerve The number of network iteration specifies weight_decay parameters and bias parameters, wherein weight_decay ginsengs in each round iteration Number refers to the speed that weights are decayed in an iteration, and bias parameters refer to deviation and are used for smooth statistics exception；

4.4) it deletes training and obtains the rule that score value is 0, the rule ultimately generated.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

1, the present invention is realized generates Chinese key rule according to specific mail collection, solves current Chinese keyword Regular timeliness is insufficient.

2, the present invention can best suit user spy according to a specific user group for the definition generation of spam The keyword rule of sign.

3, present invention improves over keyword feature extracting methods, cancel in conjunction with the feature extracting method of word frequency and document frequency The influence of some common words.

4, the present invention uses neural network algorithm computation rule score value, more traditional genetic algorithm to reduce study Expense.

Description of the drawings

Fig. 1 is the data flow diagram of the method for the present invention specific implementation.

Fig. 2 is the flow chart that keyword method is chosen in the present invention.

Fig. 3 is the flow chart that keyword rule score value method is determined in the present invention.

Specific implementation mode

The present invention is further explained in the light of specific embodiments.

The present embodiment comments the method for the present invention using open Chinese email data set SEWMTest Corpus2011 It surveys, accidentally filterability, leakage filterability, logic is averagely missed into filterability, metric as evaluation index, the data flow of this example is as schemed Shown in 1, realize that the Chinese key rule generating method detailed process of the Spam filtering is as follows：

English email in step 1, the above-mentioned SEWMTest Corpus2011 data sets of rejecting, chooses middle culture-stamp therein Part.Since target is extraction Chinese rules, and there are English emails for SEWMTest Corpus2011 mails concentration, therefore first Non- Chinese email is rejected from training set, according to being to decode mail, the Unicode codings of character is obtained, then judges postal Part whether there is be more than or equal to u4e00 and less than or equal to u9fa5 character, if there are the ranges for mail head or mail body Character, then it is assumed that be Chinese email, finally obtain and only retain such Chinese email set, obtain 4740 envelope spams and 5678 envelope normal emails.

All mails that the mail that step 2, decoding above-mentioned steps 1 obtain is concentrated.Keyword mainly from mail matter topics and Mail body obtains, therefore is decoded to mail, and the Chinese for obtaining mail head and message body indicates.

The Chinese text production of step 3, the mail head and message body that are generated using Chinese Word Automatic Segmentation processing above-mentioned steps 2 Feature candidate word is given birth to, jieba Chinese word segmentations library is used in this example, jieba is a popular Chinese word segmentation library of increasing income, and is supported Accurate model and search engine pattern, accurate model, which attempts to obtain, most accurately to be segmented, and search engine pattern is then with as far as possible Ground subdivision sentence obtains fine-grained participle.The accurate model in the library is used in this example.

Step 4, the feature candidate word obtained for above-mentioned steps 3 count word frequency respectively, and document frequency is first counted according to above-mentioned Calculate then feature extraction that word frequency calculates method (T% takes 70%, empirical value) the progress keyword of document frequency.From network To the rule of CCERT include 332 mail head's rules and 154 mail bodies rules, and this example needs to carry out therewith pair Than, therefore feature extraction selects 330 rule of mail head, 150 rule of mail body selection.

Step 5, all rules for removing SpamAssassin, and bayesian algorithm is disabled, eliminate the shadow of other rules It rings.In order to avoid generating excessively high score value, it will be determined as that the threshold value of spam is set as 1.0.

The mail collection that above-mentioned steps 1 obtain is divided into training set and test set by step 6, and wherein training dataset accounts for Chinese The 70% of mail collection, test set take remaining 30%.Mass-check scripts, which are provided, using SpamAssassin calls above-mentioned step The rapid 4 keyword rules generated carry out rubbish filtering to training dataset, obtain these rules in spam and non-junk postal The triggering situation of part.

Step 7, the regular triggering situation for obtaining above-mentioned steps 6 are as the input of neural network, by under stochastic gradient The iteration of drop method obtains regular score.The difficult point of this step is how to confirm four parameters, and wherein ham_ is arranged in this example Preference=2.0 (is left default value), and weight_decay=1.0 (is left default value), due in above-mentioned steps 5 Middle threshold value is opposite to be reduced, therefore learning_rate is reduced to 0.02 (acquiescence is 2.0, and it is empirical value, phase to take 0.02 here It is preferable to effect).Num_epochs will determine the number of neural network iteration, be examined in above-mentioned perceived control without providing convergence The logic for being automatically stopped operation is surveyed, but iterations are determined by user.In order to determine iterations, SpamAssassin's It (is non-spam email or non-spam filtering that perceived control, which is added in realizing and prints corresponding misclassification per 10epoch, Spam filtering spam) mail number logic, make discovery from observation in num_epochs=3000, the above method Basic convergence, therefore it is 3000 that iterations, which are arranged,.

Above-mentioned steps 7 are trained the redundant rule elimination that obtained score value is 0 by step 8, the obtained final rule automatically generated Then.

Step 9, the rule ultimately generated to above-mentioned steps 8 are assessed.

Wherein step 1- steps 4 purpose is to concentrate to choose keyword, flow chart such as Fig. 2 institutes of this part from mail data Show, step 4- step 8 purposes are to determine that the score value of keyword rule, the flow chart of this part are as shown in Figure 3.

The result and use CCERT rules to above-mentioned SEWMTestCorpus2011 that final step 9 implements examples detailed above The result that mail collection carries out filtrating mail is compared, and is described as follows to evaluation index：

Accidentally filterability hm%, definition are the ratios that non-spam email is misidentified as that spam accounts for total non-spam email Example.

Filterability sm% is leaked, definition is the ratio that spam is misidentified as that non-spam email accounts for all spams Example.

Logic averagely misses filterability lam%, and definition is the geometry of non-spam email and the mistake filtering ratio of spam Meaning.

Metric<Hm%=0.1, sm%>, i.e. when hm%=0.1 corresponding sm% value.

Contrast test result is as follows：

From the above experiments, it was found that the regular effect that the Chinese rules collection ratio CCERT that this example generates is provided is wanted Good, it is the same data set to be primarily due to training set and test set, has better correlation.This means that certainly according to user The existing mail training set training rules effect of body can be more preferable, embodies the meaning of the method for the present invention, is worthy to be popularized.

Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore Change made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.

Claims

1. a kind of Chinese key rule generating method of Spam filtering, it is characterised in that：This method passes through to giving postal Part data set carries out all vocabulary conducts that data prediction obtains the mail head of mail and mail body portion in the data set Keyword candidate word by combining the feature extracting method of word frequency and document frequency to select keyword, and then uses the keyword Rule carries out filtrating mail to obtain triggering of the keyword rule in spam and normal email to above-mentioned mail data collection Situation, and use this triggering situation as the input of neural network algorithm, by training neural network straight with machine descending method It is restrained to filter effect, converts the weight that training obtains to the score of rule, finally obtained rule can be applied to be based on In the solution of the filtrating mail of rule；It specifically includes following steps：

1) mail data collection is pre-processed to obtain keyword candidate word set by mail screening, mail parsing, Chinese word segmentation It closes；

2) to whole glossary statistic word frequency of candidate word set, document frequency, by comparing document frequency after first comparing word frequency Feature extraction selects keyword from candidate word set；

2. a kind of Chinese key rule generating method of Spam filtering according to claim 1, it is characterised in that： In step 1), the mail screening refers to the pure English file rejected mail data and concentrated, and the mail parsing is to realize base Parsing is carried out to Mail Contents in RFC822 and MIME agreements and is partitioned into the part that mail head and mail body are chosen in different parts, The Chinese word segmentation is segmented to the content of text of mail head and mail body using Chinese word segmentation tool.

3. a kind of Chinese key rule generating method of Spam filtering according to claim 1, it is characterised in that： In step 2), Feature Selection is done in conjunction with the method for word frequency and document frequency, keyword is determined, includes the following steps：

2.1) word frequency, document frequencies are counted, word frequency refers to the number that a word occurs in a document, and document frequency points out some existing time Select the document number of word；

2.2) the highest N number of word of word frequency in spam is chosen；

2.3) according to formula spam (wi)/spam (wi)+ham (wi)>T% filters out keyword, and the wi for meeting the formula is One keyword, wherein wi indicate that some word in the highest N number of set of words of word frequency, spam (wi) indicate the rubbish for including word wi Rubbish mail number, ham (wi) indicate that the normal email number for including word wi, T% indicate the threshold value of some setting.

4. a kind of Chinese key rule generating method of Spam filtering according to claim 1, it is characterised in that： In step 3), the keyword triggering situation that mail data concentrates often envelope mail is collected using Open-Source Tools SpamAssassin, And triggering situation data are formatted, include the following steps：

3.1) strictly all rules built in SpamAssassin are disabled, bayesian algorithm is deactivated, eliminates the influence of Else Rule, are added The keyword rule generated in step 2)；

5. a kind of Chinese key rule generating method of Spam filtering according to claim 1, it is characterised in that： In step 4), the use of neural network algorithm is that the keyword rule generated in step 2) assigns score value, includes the following steps：

4.1) non-spam email is subjected to redundancy duplication first, the formula for adding the number of non-spam email is 1+ (number_ Of_test_hit) * ham_preference, ham_preference input for parameter, are defaulted as 2.0, number_of_ Test_hit refers to the mail triggers how many rule；

4.2) it is that the weight in particular range is randomly assigned per rule, range is determined by the case where rule triggering mail number It is fixed；

4.3) it is trained using neural network algorithm, num_epochs rear stopping of iteration, num_epochs refers to neural network The number of iteration specifies weight_decay parameters and bias parameters, wherein the weight_decay parameters to be in each round iteration Refer to the speed that weights are decayed in an iteration, bias parameters refer to deviation and are used for smoothly counting abnormal；