CN111553388A - Junk mail detection method based on online AdaBoost - Google Patents
Junk mail detection method based on online AdaBoost Download PDFInfo
- Publication number
- CN111553388A CN111553388A CN202010265704.7A CN202010265704A CN111553388A CN 111553388 A CN111553388 A CN 111553388A CN 202010265704 A CN202010265704 A CN 202010265704A CN 111553388 A CN111553388 A CN 111553388A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- training
- weak classifier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
Abstract
The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost. The invention applies the idea of online learning to AdaBoost for training the strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.
Description
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost.
Background
With the development of the information age, the communication between people is more convenient. Email has become a very important communication tool in social interaction, but while email brings effective information communication to people, email also brings a great deal of spam. It is common, by statistics, for a user to receive hundreds of emails each day. Nearly 90% of these mail items are spam, which includes advertisements for various products and services. Spam not only forces the user to identify unwanted mail, consuming the user's time, but also wastes storage space and network bandwidth. Spam detection has become one of the great challenges facing the information security field, and machine learning has been widely applied in the fields of spam detection and the like. However, the traditional spam detection algorithm has many defects, such as low detection accuracy of a single machine learning algorithm, incapability of timely adjusting the model in a dynamic environment by a batch learning algorithm, high training cost and the like. Aiming at the problems, the method uses an AdaBoost algorithm to combine the trained weak classifiers into a strong classifier so as to improve the classification effect; on the basis, the online learning idea is introduced, so that the training overhead is reduced, and the online learning method can adapt to the change in the network in a dynamic environment. The method effectively solves the problem of unstable classification performance of the traditional mail classification method, can work well in a dynamic environment, and reduces the training cost. Therefore, compared with the existing junk mail detection method, the method provided by the invention has the advantages of higher accuracy, stronger environmental adaptability, higher efficiency and easier expansion.
Disclosure of Invention
The invention aims to provide a spam detection method based on online AdaBoost, which improves the spam detection accuracy and the model training efficiency and is suitable for a dynamic environment.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: inputting a mail sample to be detected; taking part of mail sample data to construct a training set, wherein for each mail sample (X, Y), X is a feature set of the mail sample (X, Y); y is a label of the mail sample (X, Y) and is used for marking whether the mail sample is a feature set of the junk mails or not, and Y is manually marked in the training set;
step 2: training D weak classifiers by using training set, and initializing weight counter lambda of weak classifiert scAnd λt sw1,2, D, where λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively; the specific process is as follows:
step 2.1: from trainingA sample (X, Y) is extracted in a centralized way and input into a weak classifier htPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)tLearning the sample k times using a bernoulli-based multivariate naive bayes model;
step 2.2: let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample; m is the number of features of the sample (X, Y), and an intermediate conditional probability P (X | Y ═ C) is calculatedk):
Wherein, CkRepresents a category of mail, i.e., normal mail or spam;
step 2.3: calculating CkProbability of occurrence within the training set P (Y ═ C)k):
Wherein n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set;
step 2.4: calculating the probability P (Y ═ C) that the mail sample (X, Y) is spamk|X):
The probability that the sample is the normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities;
step 2.5: comparing the predicted result with the actual result of the sample;
if weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y); calculating lambdat sc←λt sc+ λ, updating the correct classification weighting counter, where λ is the sample weight; computingUpdating approximate weighted misclassification ratest(ii) a And (3) calculating:updating the weights of the samples (X, Y);
if weak classifier htMisclassifying the sample; calculating lambdat sw←λt sw+ lambda, update the misclassification weighting counter, calculate the formula in the same wayUpdating the approximate weighted misclassification rate; computingUpdating the weights of the samples (X, Y);
step 2.6: calculating weak classifier htWeight α oftFinish the weak classifier htUpdating of (1);
step 2.7: inputting the updated sample into the next weak classifier, and repeatedly executing the steps 2.2 to 2.6 until all the weak classifiers are updated, completing one cycle and selecting the weak classifier with the highest weight;
step 2.8: judging whether the training of the mail samples in the training set of all the areas is finished or not; if not, returning to the step 2.1; if all mail samples in the training set are trained, integrating strong classifiers H (X) by using all selected weak classifiers;
and step 3: inputting the rest mail samples to be detected into a strong classifier H (X) to finish the detection of the junk mails.
The invention has the beneficial effects that:
the invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.
Drawings
Fig. 1 is a diagram of the steps of online AdaBoost training and picking weak classifiers.
FIG. 2 is a process diagram of combining strong classifiers.
Fig. 3 is a flow chart of the method implementation and application of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the method introduces the idea of online learning on the basis of AdaBoost, improves the classification effect, greatly reduces the expenditure of training the model, and ensures that the model can adapt to the environment with big data scenes and dynamic changes in the spam detection, thereby obtaining better generalization performance.
A junk mail detection method based on online AdaBoost comprises the following steps:
the OAdaboostNBC is constructed by taking a naive Bayes classifier as a base classifier, and simultaneously combines the idea of online learning. A plurality of weak classifiers are trained by utilizing an online AdaBoost and naive Bayes algorithm, and a strong classifier is formed by selecting a plurality of weak classifiers with the best effect from the weak classifiers to classify mail samples.
The steps of training a strong classifier by using an online AdaBoost and naive Bayes algorithm are as follows:
(1) training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weighted counter of correct and wrong results of each weak classifier.
(2) For newly input samples, the weak classifiers are trained through a Bernoulli-based multivariate naive Bayes model, the weighting counters of correct and wrong classification of the weak classifiers are updated according to the classification results of the new samples, the weights of the current samples and the weights of the weak classifiers are calculated and updated, then the samples with the updated weights are input into other weak classifiers in sequence until the updating of all the weak classifiers is completed, and the weak classifier with the highest weight is selected from the samples, so that a cycle is completed.
And combining the weak classifiers selected in each cycle into a strong classifier.
The invention provides a method for improving the spam detection accuracy and training the model efficiency and adapting to a dynamic environment, and the method is an efficient spam detection method based on online AdaBoost. The method comprises the following 5 steps:
1. training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weight counter lambda of the weak classifierst scAnd λt sw(t ═ 1, 2.., D). Wherein λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively.
2. Definition htAnd (4) a weak classifier which is input for the new sample (X, Y) for the first time, wherein X is a feature set of the mail sample (X, Y), and Y is a label of the mail sample (X, Y) and marks whether the mail sample (X, Y) is spam or not. The weight lambda of the initialized sample (X, Y) is 1, a positive integer k is randomly selected from Poisson distribution (lambda), and the weak classifier htThe sample (X, Y) was learned k times using a bernoulli-based multivariate naive bayes model. The specific learning process is as follows:
(1) let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample (0 indicates the feature t)iNot present in the sample, 1 indicates that the feature is present in the sample), and m is the number of features of the sample (X, Y).
Calculating formula (6):
obtained P (X | Y ═ C)k) To solve for intermediate conditional probabilities in the process of spam probability, where CkRepresenting the category of mail (normal mail or spam). Calculating formula (7):
obtained P (Y ═ C)k) Is CkProbability of occurrence within the training set, n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set. Calculating bayes formula (8):
obtained P (Y ═ C)kAnd | X) is the probability that the mail sample (X, Y) is spam, the probability that the sample is normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities.
3. Comparing the predicted result with the actual result if the weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y), formula (1) is calculated:
λt sc←λt sc+λ (1)
updating the correct classification weight counter, where λ is the sample weighttTo weight the misclassification rate, equation (2) is calculated:
updating the approximate weighted misclassification rate, and calculating formula (3):
the weights of the samples (X, Y) are updated.
If weak classifier htFor this sample error classification, the formula λ is calculatedt sw←λt sw+ λ, update the misclassification weighted counter, calculate equation (2) in the same way, update the approximate weighted misclassification rate. Formula for calculationThe weights of the samples (X, Y) are updated.
4. Calculating formula (4):
complete weak classifier htWherein αtIs a weak classifier htThe weight of (c).
5. And inputting the updated sample into the next weak classifier, and continuing to execute the updating steps of the sample and the weak classifier until all the weak classifiers are updated, thereby completing one cycle. And performing a cycle every time a sample is newly input, and selecting a weak classifier with the highest weight in each cycle. After the circulation is finished, namely all mail samples are trained, all selected weak classifiers are used for integrating strong classifiers, such as formula (5):
h is the final strong classifier after training.
After the above 5 steps, a spam detection method based on online AdaBoost is formed. The method enhances the classification stability of the junk mails, reduces the training overhead, and can better adapt to the dynamically changing environment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (1)
1. A spam detection method based on online AdaBoost is characterized by comprising the following steps:
step 1: inputting a mail sample to be detected; taking part of mail sample data to construct a training set, wherein for each mail sample (X, Y), X is a feature set of the mail sample (X, Y); y is a label of the mail sample (X, Y) and is used for marking whether the mail sample is a feature set of the junk mails or not, and Y is manually marked in the training set;
step 2: training D weak classifiers by using training set, and initializing weight counter lambda of weak classifiert scAnd λt sw1,2, D, where λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively; the specific process is as follows:
step 2.1: extracting a sample (X, Y) from the training set and inputting the sample into a weak classifier htPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)tLearning the sample k times using a bernoulli-based multivariate naive bayes model;
step 2.2: let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample; m is the number of features of the sample (X, Y), and an intermediate conditional probability P (X | Y ═ C) is calculatedk):
Wherein, CkRepresents a category of mail, i.e., normal mail or spam;
step 2.3: calculating CkProbability of occurrence within the training set P (Y ═ C)k):
Wherein n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set;
step 2.4: calculating the probability P (Y ═ C) that the mail sample (X, Y) is spamk|X):
The probability that the sample is the normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities;
step 2.5: comparing the predicted result with the actual result of the sample;
if weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y); calculating lambdat sc←λt sc+ λ, updating the correct classification weighting counter, where λ is the sample weight; computingUpdating approximate weighted misclassification ratest(ii) a And (3) calculating:updating the weights of the samples (X, Y);
if weak classifier htMisclassifying the sample; calculating lambdat sw←λt sw+ lambda, update the misclassification weighting counter, calculate the formula in the same wayUpdating the approximate weighted misclassification rate; computingUpdating the weights of the samples (X, Y);
step 2.6: calculating weak classifier htWeight α oftFinish the weak classifier htUpdating of (1);
step 2.7: inputting the updated sample into the next weak classifier, and repeatedly executing the steps 2.2 to 2.6 until all the weak classifiers are updated, completing one cycle and selecting the weak classifier with the highest weight;
step 2.8: judging whether the training of the mail samples in the training set of all the areas is finished or not; if not, returning to the step 2.1; if all mail samples in the training set are trained, integrating strong classifiers H (X) by using all selected weak classifiers;
and step 3: inputting the rest mail samples to be detected into a strong classifier H (X) to finish the detection of the junk mails.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010265704.7A CN111553388A (en) | 2020-04-07 | 2020-04-07 | Junk mail detection method based on online AdaBoost |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010265704.7A CN111553388A (en) | 2020-04-07 | 2020-04-07 | Junk mail detection method based on online AdaBoost |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111553388A true CN111553388A (en) | 2020-08-18 |
Family
ID=72004269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010265704.7A Pending CN111553388A (en) | 2020-04-07 | 2020-04-07 | Junk mail detection method based on online AdaBoost |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111553388A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221623A (en) * | 2008-01-30 | 2008-07-16 | 北京中星微电子有限公司 | Object type on-line training and recognizing method and system thereof |
CN103593672A (en) * | 2013-05-27 | 2014-02-19 | 深圳市智美达科技有限公司 | Adaboost classifier on-line learning method and Adaboost classifier on-line learning system |
CN107133597A (en) * | 2017-05-11 | 2017-09-05 | 南宁市正祥科技有限公司 | A kind of front vehicles detection method in the daytime |
CN108537279A (en) * | 2018-04-11 | 2018-09-14 | 中南大学 | Based on the data source grader construction method for improving Adaboost algorithm |
US20180293381A1 (en) * | 2017-04-07 | 2018-10-11 | Trustpath Inc. | System and method for malware detection on a per packet basis |
CN108804651A (en) * | 2018-06-07 | 2018-11-13 | 南京邮电大学 | A kind of Social behaviors detection method based on reinforcing Bayes's classification |
CN110149268A (en) * | 2019-05-15 | 2019-08-20 | 深圳市趣创科技有限公司 | A kind of method and its system of automatic fitration spam |
CN110737805A (en) * | 2019-10-18 | 2020-01-31 | 网易(杭州)网络有限公司 | Method and device for processing graph model data and terminal equipment |
-
2020
- 2020-04-07 CN CN202010265704.7A patent/CN111553388A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221623A (en) * | 2008-01-30 | 2008-07-16 | 北京中星微电子有限公司 | Object type on-line training and recognizing method and system thereof |
CN103593672A (en) * | 2013-05-27 | 2014-02-19 | 深圳市智美达科技有限公司 | Adaboost classifier on-line learning method and Adaboost classifier on-line learning system |
US20180293381A1 (en) * | 2017-04-07 | 2018-10-11 | Trustpath Inc. | System and method for malware detection on a per packet basis |
CN107133597A (en) * | 2017-05-11 | 2017-09-05 | 南宁市正祥科技有限公司 | A kind of front vehicles detection method in the daytime |
CN108537279A (en) * | 2018-04-11 | 2018-09-14 | 中南大学 | Based on the data source grader construction method for improving Adaboost algorithm |
CN108804651A (en) * | 2018-06-07 | 2018-11-13 | 南京邮电大学 | A kind of Social behaviors detection method based on reinforcing Bayes's classification |
CN110149268A (en) * | 2019-05-15 | 2019-08-20 | 深圳市趣创科技有限公司 | A kind of method and its system of automatic fitration spam |
CN110737805A (en) * | 2019-10-18 | 2020-01-31 | 网易(杭州)网络有限公司 | Method and device for processing graph model data and terminal equipment |
Non-Patent Citations (3)
Title |
---|
WEIMING HU等: "Online Adaboost-Based Parameterized Methods for Dynamic Distributed Network Intrusion Detection", 《IEEE TRANSACTIONS ON CYBERNETICS》 * |
李茹: "基于最小风险贝叶斯的多级邮件过滤系统的研究实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
翟军昌: "基于朴素贝叶斯算法的个性化垃圾邮件过滤", 《长春师范学院学报(自然科学版)》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108256561B (en) | Multi-source domain adaptive migration method and system based on counterstudy | |
CN108564129B (en) | Trajectory data classification method based on generation countermeasure network | |
CN109190442B (en) | Rapid face detection method based on deep cascade convolution neural network | |
CN101937513B (en) | Information processing apparatus and information processing method | |
CN104699772B (en) | A kind of big data file classification method based on cloud computing | |
CN107392919B (en) | Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method | |
CN108364016A (en) | Gradual semisupervised classification method based on multi-categorizer | |
CN110378366B (en) | Cross-domain image classification method based on coupling knowledge migration | |
JP7266674B2 (en) | Image classification model training method, image processing method and apparatus | |
CN103559504A (en) | Image target category identification method and device | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN113887643B (en) | New dialogue intention recognition method based on pseudo tag self-training and source domain retraining | |
CN109919055B (en) | Dynamic human face emotion recognition method based on AdaBoost-KNN | |
Asadi et al. | Creating discriminative models for time series classification and clustering by HMM ensembles | |
CN103279746A (en) | Method and system for identifying faces based on support vector machine | |
CN116503676B (en) | Picture classification method and system based on knowledge distillation small sample increment learning | |
Solanki et al. | Spam filtering using hybrid local-global Naive Bayes classifier | |
CN108198324B (en) | A kind of multinational bank note currency type recognition methods based on finger image | |
CN107220663A (en) | A kind of image automatic annotation method classified based on semantic scene | |
US20060179017A1 (en) | Preparing data for machine learning | |
CN116644339B (en) | Information classification method and system | |
CN112163069B (en) | Text classification method based on graph neural network node characteristic propagation optimization | |
CN104281569A (en) | Building device and method, classifying device and method and electronic device | |
CN111553388A (en) | Junk mail detection method based on online AdaBoost | |
CN115828100A (en) | Mobile phone radiation source spectrogram category increment learning method based on deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200818 |
|
RJ01 | Rejection of invention patent application after publication |