CN111553388A

CN111553388A - Junk mail detection method based on online AdaBoost

Info

Publication number: CN111553388A
Application number: CN202010265704.7A
Authority: CN
Inventors: 李静梅; 王洪涛; 茹晨广
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-08-18

Abstract

The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost. The invention applies the idea of online learning to AdaBoost for training the strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.

Description

Junk mail detection method based on online AdaBoost

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost.

Background

With the development of the information age, the communication between people is more convenient. Email has become a very important communication tool in social interaction, but while email brings effective information communication to people, email also brings a great deal of spam. It is common, by statistics, for a user to receive hundreds of emails each day. Nearly 90% of these mail items are spam, which includes advertisements for various products and services. Spam not only forces the user to identify unwanted mail, consuming the user's time, but also wastes storage space and network bandwidth. Spam detection has become one of the great challenges facing the information security field, and machine learning has been widely applied in the fields of spam detection and the like. However, the traditional spam detection algorithm has many defects, such as low detection accuracy of a single machine learning algorithm, incapability of timely adjusting the model in a dynamic environment by a batch learning algorithm, high training cost and the like. Aiming at the problems, the method uses an AdaBoost algorithm to combine the trained weak classifiers into a strong classifier so as to improve the classification effect; on the basis, the online learning idea is introduced, so that the training overhead is reduced, and the online learning method can adapt to the change in the network in a dynamic environment. The method effectively solves the problem of unstable classification performance of the traditional mail classification method, can work well in a dynamic environment, and reduces the training cost. Therefore, compared with the existing junk mail detection method, the method provided by the invention has the advantages of higher accuracy, stronger environmental adaptability, higher efficiency and easier expansion.

Disclosure of Invention

The invention aims to provide a spam detection method based on online AdaBoost, which improves the spam detection accuracy and the model training efficiency and is suitable for a dynamic environment.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: inputting a mail sample to be detected; taking part of mail sample data to construct a training set, wherein for each mail sample (X, Y), X is a feature set of the mail sample (X, Y); y is a label of the mail sample (X, Y) and is used for marking whether the mail sample is a feature set of the junk mails or not, and Y is manually marked in the training set;

step 2: training D weak classifiers by using training set, and initializing weight counter lambda of weak classifier_t ^scAnd λ_t ^sw1,2, D, where λ_t ^scAnd λ_t ^swCounters representing correctly classified samples and incorrectly classified samples, respectively; the specific process is as follows:

step 2.1: from trainingA sample (X, Y) is extracted in a centralized way and input into a weak classifier h_tPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)_tLearning the sample k times using a bernoulli-based multivariate naive bayes model;

step 2.2: let X be (t)₁,...,t_m) Wherein each t_iIs a binary variable that indicates whether the feature is present in the sample; m is the number of features of the sample (X, Y), and an intermediate conditional probability P (X | Y ═ C) is calculated_k)：

Wherein, C_kRepresents a category of mail, i.e., normal mail or spam;

step 2.3: calculating C_kProbability of occurrence within the training set P (Y ═ C)_k)：

Wherein n (C)_i) Is represented by a category C_iThe frequency of occurrence of the samples in the training set;

step 2.4: calculating the probability P (Y ═ C) that the mail sample (X, Y) is spam_k|X)：

The probability that the sample is the normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities;

step 2.5: comparing the predicted result with the actual result of the sample;

if weak classifier h_tCorrectly classify this sample, i.e. h_t(X) ═ sign (y); calculating lambda_t ^sc←λ_t ^sc+ λ, updating the correct classification weighting counter, where λ is the sample weight; computing

Updating approximate weighted misclassification rates_t(ii) a And (3) calculating:

updating the weights of the samples (X, Y);

if weak classifier h_tMisclassifying the sample; calculating lambda_t ^sw←λ_t ^sw+ lambda, update the misclassification weighting counter, calculate the formula in the same way

Updating the approximate weighted misclassification rate; computing

Updating the weights of the samples (X, Y);

step 2.6: calculating weak classifier h_tWeight α of_tFinish the weak classifier h_tUpdating of (1);

step 2.7: inputting the updated sample into the next weak classifier, and repeatedly executing the steps 2.2 to 2.6 until all the weak classifiers are updated, completing one cycle and selecting the weak classifier with the highest weight;

step 2.8: judging whether the training of the mail samples in the training set of all the areas is finished or not; if not, returning to the step 2.1; if all mail samples in the training set are trained, integrating strong classifiers H (X) by using all selected weak classifiers;

and step 3: inputting the rest mail samples to be detected into a strong classifier H (X) to finish the detection of the junk mails.

The invention has the beneficial effects that:

the invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.

Drawings

Fig. 1 is a diagram of the steps of online AdaBoost training and picking weak classifiers.

FIG. 2 is a process diagram of combining strong classifiers.

Fig. 3 is a flow chart of the method implementation and application of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the method introduces the idea of online learning on the basis of AdaBoost, improves the classification effect, greatly reduces the expenditure of training the model, and ensures that the model can adapt to the environment with big data scenes and dynamic changes in the spam detection, thereby obtaining better generalization performance.

A junk mail detection method based on online AdaBoost comprises the following steps:

the OAdaboostNBC is constructed by taking a naive Bayes classifier as a base classifier, and simultaneously combines the idea of online learning. A plurality of weak classifiers are trained by utilizing an online AdaBoost and naive Bayes algorithm, and a strong classifier is formed by selecting a plurality of weak classifiers with the best effect from the weak classifiers to classify mail samples.

The steps of training a strong classifier by using an online AdaBoost and naive Bayes algorithm are as follows:

(1) training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weighted counter of correct and wrong results of each weak classifier.

(2) For newly input samples, the weak classifiers are trained through a Bernoulli-based multivariate naive Bayes model, the weighting counters of correct and wrong classification of the weak classifiers are updated according to the classification results of the new samples, the weights of the current samples and the weights of the weak classifiers are calculated and updated, then the samples with the updated weights are input into other weak classifiers in sequence until the updating of all the weak classifiers is completed, and the weak classifier with the highest weight is selected from the samples, so that a cycle is completed.

And combining the weak classifiers selected in each cycle into a strong classifier.

The invention provides a method for improving the spam detection accuracy and training the model efficiency and adapting to a dynamic environment, and the method is an efficient spam detection method based on online AdaBoost. The method comprises the following 5 steps:

1. training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weight counter lambda of the weak classifiers_t ^scAnd λ_t ^sw(t ═ 1, 2.., D). Wherein λ_t ^scAnd λ_t ^swCounters representing correctly classified samples and incorrectly classified samples, respectively.

2. Definition h_tAnd (4) a weak classifier which is input for the new sample (X, Y) for the first time, wherein X is a feature set of the mail sample (X, Y), and Y is a label of the mail sample (X, Y) and marks whether the mail sample (X, Y) is spam or not. The weight lambda of the initialized sample (X, Y) is 1, a positive integer k is randomly selected from Poisson distribution (lambda), and the weak classifier h_tThe sample (X, Y) was learned k times using a bernoulli-based multivariate naive bayes model. The specific learning process is as follows:

(1) let X be (t)₁,...,t_m) Wherein each t_iIs a binary variable that indicates whether the feature is present in the sample (0 indicates the feature t)_iNot present in the sample, 1 indicates that the feature is present in the sample), and m is the number of features of the sample (X, Y).

Calculating formula (6):

obtained P (X | Y ═ C)_k) To solve for intermediate conditional probabilities in the process of spam probability, where C_kRepresenting the category of mail (normal mail or spam). Calculating formula (7):

obtained P (Y ═ C)_k) Is C_kProbability of occurrence within the training set, n (C)_i) Is represented by a category C_iThe frequency of occurrence of the samples in the training set. Calculating bayes formula (8):

obtained P (Y ═ C)_kAnd | X) is the probability that the mail sample (X, Y) is spam, the probability that the sample is normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities.

3. Comparing the predicted result with the actual result if the weak classifier h_tCorrectly classify this sample, i.e. h_t(X) ═ sign (y), formula (1) is calculated:

λ_t ^sc←λ_t ^sc+λ (1)

updating the correct classification weight counter, where λ is the sample weight_tTo weight the misclassification rate, equation (2) is calculated:

updating the approximate weighted misclassification rate, and calculating formula (3):

the weights of the samples (X, Y) are updated.

If weak classifier h_tFor this sample error classification, the formula λ is calculated_t ^sw←λ_t ^sw+ λ, update the misclassification weighted counter, calculate equation (2) in the same way, update the approximate weighted misclassification rate. Formula for calculation

The weights of the samples (X, Y) are updated.

4. Calculating formula (4):

complete weak classifier h_tWherein α_tIs a weak classifier h_tThe weight of (c).

5. And inputting the updated sample into the next weak classifier, and continuing to execute the updating steps of the sample and the weak classifier until all the weak classifiers are updated, thereby completing one cycle. And performing a cycle every time a sample is newly input, and selecting a weak classifier with the highest weight in each cycle. After the circulation is finished, namely all mail samples are trained, all selected weak classifiers are used for integrating strong classifiers, such as formula (5):

h is the final strong classifier after training.

After the above 5 steps, a spam detection method based on online AdaBoost is formed. The method enhances the classification stability of the junk mails, reduces the training overhead, and can better adapt to the dynamically changing environment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A spam detection method based on online AdaBoost is characterized by comprising the following steps:

step 2.1: extracting a sample (X, Y) from the training set and inputting the sample into a weak classifier h_tPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)_tLearning the sample k times using a bernoulli-based multivariate naive bayes model;

Wherein, C_kRepresents a category of mail, i.e., normal mail or spam;

step 2.5: comparing the predicted result with the actual result of the sample;

updating the weights of the samples (X, Y);

Updating the approximate weighted misclassification rate; computing

Updating the weights of the samples (X, Y);