CN111553388A - Junk mail detection method based on online AdaBoost - Google Patents

Junk mail detection method based on online AdaBoost Download PDF

Info

Publication number
CN111553388A
CN111553388A CN202010265704.7A CN202010265704A CN111553388A CN 111553388 A CN111553388 A CN 111553388A CN 202010265704 A CN202010265704 A CN 202010265704A CN 111553388 A CN111553388 A CN 111553388A
Authority
CN
China
Prior art keywords
sample
mail
samples
training
weak classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010265704.7A
Other languages
Chinese (zh)
Inventor
李静梅
王洪涛
茹晨广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202010265704.7A priority Critical patent/CN111553388A/en
Publication of CN111553388A publication Critical patent/CN111553388A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Abstract

The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost. The invention applies the idea of online learning to AdaBoost for training the strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.

Description

Junk mail detection method based on online AdaBoost
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a junk mail detection method based on online AdaBoost.
Background
With the development of the information age, the communication between people is more convenient. Email has become a very important communication tool in social interaction, but while email brings effective information communication to people, email also brings a great deal of spam. It is common, by statistics, for a user to receive hundreds of emails each day. Nearly 90% of these mail items are spam, which includes advertisements for various products and services. Spam not only forces the user to identify unwanted mail, consuming the user's time, but also wastes storage space and network bandwidth. Spam detection has become one of the great challenges facing the information security field, and machine learning has been widely applied in the fields of spam detection and the like. However, the traditional spam detection algorithm has many defects, such as low detection accuracy of a single machine learning algorithm, incapability of timely adjusting the model in a dynamic environment by a batch learning algorithm, high training cost and the like. Aiming at the problems, the method uses an AdaBoost algorithm to combine the trained weak classifiers into a strong classifier so as to improve the classification effect; on the basis, the online learning idea is introduced, so that the training overhead is reduced, and the online learning method can adapt to the change in the network in a dynamic environment. The method effectively solves the problem of unstable classification performance of the traditional mail classification method, can work well in a dynamic environment, and reduces the training cost. Therefore, compared with the existing junk mail detection method, the method provided by the invention has the advantages of higher accuracy, stronger environmental adaptability, higher efficiency and easier expansion.
Disclosure of Invention
The invention aims to provide a spam detection method based on online AdaBoost, which improves the spam detection accuracy and the model training efficiency and is suitable for a dynamic environment.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: inputting a mail sample to be detected; taking part of mail sample data to construct a training set, wherein for each mail sample (X, Y), X is a feature set of the mail sample (X, Y); y is a label of the mail sample (X, Y) and is used for marking whether the mail sample is a feature set of the junk mails or not, and Y is manually marked in the training set;
step 2: training D weak classifiers by using training set, and initializing weight counter lambda of weak classifiert scAnd λt sw1,2, D, where λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively; the specific process is as follows:
step 2.1: from trainingA sample (X, Y) is extracted in a centralized way and input into a weak classifier htPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)tLearning the sample k times using a bernoulli-based multivariate naive bayes model;
step 2.2: let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample; m is the number of features of the sample (X, Y), and an intermediate conditional probability P (X | Y ═ C) is calculatedk):
Figure BDA0002440538330000021
Wherein, CkRepresents a category of mail, i.e., normal mail or spam;
step 2.3: calculating CkProbability of occurrence within the training set P (Y ═ C)k):
Figure BDA0002440538330000022
Wherein n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set;
step 2.4: calculating the probability P (Y ═ C) that the mail sample (X, Y) is spamk|X):
Figure BDA0002440538330000023
The probability that the sample is the normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities;
step 2.5: comparing the predicted result with the actual result of the sample;
if weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y); calculating lambdat sc←λt sc+ λ, updating the correct classification weighting counter, where λ is the sample weight; computing
Figure BDA0002440538330000024
Updating approximate weighted misclassification ratest(ii) a And (3) calculating:
Figure BDA0002440538330000025
updating the weights of the samples (X, Y);
if weak classifier htMisclassifying the sample; calculating lambdat sw←λt sw+ lambda, update the misclassification weighting counter, calculate the formula in the same way
Figure BDA0002440538330000026
Updating the approximate weighted misclassification rate; computing
Figure BDA0002440538330000027
Updating the weights of the samples (X, Y);
step 2.6: calculating weak classifier htWeight α oftFinish the weak classifier htUpdating of (1);
Figure BDA0002440538330000031
step 2.7: inputting the updated sample into the next weak classifier, and repeatedly executing the steps 2.2 to 2.6 until all the weak classifiers are updated, completing one cycle and selecting the weak classifier with the highest weight;
step 2.8: judging whether the training of the mail samples in the training set of all the areas is finished or not; if not, returning to the step 2.1; if all mail samples in the training set are trained, integrating strong classifiers H (X) by using all selected weak classifiers;
Figure BDA0002440538330000032
and step 3: inputting the rest mail samples to be detected into a strong classifier H (X) to finish the detection of the junk mails.
The invention has the beneficial effects that:
the invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the online learning idea is introduced on the basis of AdaBoost, the classification effect is improved, the expense of training the model is greatly reduced, and the model can adapt to a big data scene and a dynamically changing environment in the spam detection, so that better generalization performance is obtained.
Drawings
Fig. 1 is a diagram of the steps of online AdaBoost training and picking weak classifiers.
FIG. 2 is a process diagram of combining strong classifiers.
Fig. 3 is a flow chart of the method implementation and application of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a spam detection method based on online Adaboost, which applies the idea of online learning to AdaBoost for training a strong classifier. The traditional spam classifier has the problems of unstable classification performance, incapability of being applied to a dynamic environment and high training cost. Aiming at the problems, the method introduces the idea of online learning on the basis of AdaBoost, improves the classification effect, greatly reduces the expenditure of training the model, and ensures that the model can adapt to the environment with big data scenes and dynamic changes in the spam detection, thereby obtaining better generalization performance.
A junk mail detection method based on online AdaBoost comprises the following steps:
the OAdaboostNBC is constructed by taking a naive Bayes classifier as a base classifier, and simultaneously combines the idea of online learning. A plurality of weak classifiers are trained by utilizing an online AdaBoost and naive Bayes algorithm, and a strong classifier is formed by selecting a plurality of weak classifiers with the best effect from the weak classifiers to classify mail samples.
The steps of training a strong classifier by using an online AdaBoost and naive Bayes algorithm are as follows:
(1) training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weighted counter of correct and wrong results of each weak classifier.
(2) For newly input samples, the weak classifiers are trained through a Bernoulli-based multivariate naive Bayes model, the weighting counters of correct and wrong classification of the weak classifiers are updated according to the classification results of the new samples, the weights of the current samples and the weights of the weak classifiers are calculated and updated, then the samples with the updated weights are input into other weak classifiers in sequence until the updating of all the weak classifiers is completed, and the weak classifier with the highest weight is selected from the samples, so that a cycle is completed.
And combining the weak classifiers selected in each cycle into a strong classifier.
The invention provides a method for improving the spam detection accuracy and training the model efficiency and adapting to a dynamic environment, and the method is an efficient spam detection method based on online AdaBoost. The method comprises the following 5 steps:
1. training a plurality of weak classifiers by using a partial mail sample data set, and initializing a weight counter lambda of the weak classifierst scAnd λt sw(t ═ 1, 2.., D). Wherein λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively.
2. Definition htAnd (4) a weak classifier which is input for the new sample (X, Y) for the first time, wherein X is a feature set of the mail sample (X, Y), and Y is a label of the mail sample (X, Y) and marks whether the mail sample (X, Y) is spam or not. The weight lambda of the initialized sample (X, Y) is 1, a positive integer k is randomly selected from Poisson distribution (lambda), and the weak classifier htThe sample (X, Y) was learned k times using a bernoulli-based multivariate naive bayes model. The specific learning process is as follows:
(1) let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample (0 indicates the feature t)iNot present in the sample, 1 indicates that the feature is present in the sample), and m is the number of features of the sample (X, Y).
Calculating formula (6):
Figure BDA0002440538330000041
obtained P (X | Y ═ C)k) To solve for intermediate conditional probabilities in the process of spam probability, where CkRepresenting the category of mail (normal mail or spam). Calculating formula (7):
Figure BDA0002440538330000042
obtained P (Y ═ C)k) Is CkProbability of occurrence within the training set, n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set. Calculating bayes formula (8):
Figure BDA0002440538330000051
obtained P (Y ═ C)kAnd | X) is the probability that the mail sample (X, Y) is spam, the probability that the sample is normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities.
3. Comparing the predicted result with the actual result if the weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y), formula (1) is calculated:
λt sc←λt sc+λ (1)
updating the correct classification weight counter, where λ is the sample weighttTo weight the misclassification rate, equation (2) is calculated:
Figure BDA0002440538330000052
updating the approximate weighted misclassification rate, and calculating formula (3):
Figure BDA0002440538330000053
the weights of the samples (X, Y) are updated.
If weak classifier htFor this sample error classification, the formula λ is calculatedt sw←λt sw+ λ, update the misclassification weighted counter, calculate equation (2) in the same way, update the approximate weighted misclassification rate. Formula for calculation
Figure BDA0002440538330000054
The weights of the samples (X, Y) are updated.
4. Calculating formula (4):
Figure BDA0002440538330000055
complete weak classifier htWherein αtIs a weak classifier htThe weight of (c).
5. And inputting the updated sample into the next weak classifier, and continuing to execute the updating steps of the sample and the weak classifier until all the weak classifiers are updated, thereby completing one cycle. And performing a cycle every time a sample is newly input, and selecting a weak classifier with the highest weight in each cycle. After the circulation is finished, namely all mail samples are trained, all selected weak classifiers are used for integrating strong classifiers, such as formula (5):
Figure BDA0002440538330000061
h is the final strong classifier after training.
After the above 5 steps, a spam detection method based on online AdaBoost is formed. The method enhances the classification stability of the junk mails, reduces the training overhead, and can better adapt to the dynamically changing environment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A spam detection method based on online AdaBoost is characterized by comprising the following steps:
step 1: inputting a mail sample to be detected; taking part of mail sample data to construct a training set, wherein for each mail sample (X, Y), X is a feature set of the mail sample (X, Y); y is a label of the mail sample (X, Y) and is used for marking whether the mail sample is a feature set of the junk mails or not, and Y is manually marked in the training set;
step 2: training D weak classifiers by using training set, and initializing weight counter lambda of weak classifiert scAnd λt sw1,2, D, where λt scAnd λt swCounters representing correctly classified samples and incorrectly classified samples, respectively; the specific process is as follows:
step 2.1: extracting a sample (X, Y) from the training set and inputting the sample into a weak classifier htPerforming the following steps; the weight λ of the initialization sample (X, Y) is 1; randomly selecting a positive integer k and a weak classifier h from Poisson distribution Possions (lambda)tLearning the sample k times using a bernoulli-based multivariate naive bayes model;
step 2.2: let X be (t)1,...,tm) Wherein each tiIs a binary variable that indicates whether the feature is present in the sample; m is the number of features of the sample (X, Y), and an intermediate conditional probability P (X | Y ═ C) is calculatedk):
Figure FDA0002440538320000011
Wherein, CkRepresents a category of mail, i.e., normal mail or spam;
step 2.3: calculating CkProbability of occurrence within the training set P (Y ═ C)k):
Figure FDA0002440538320000012
Wherein n (C)i) Is represented by a category CiThe frequency of occurrence of the samples in the training set;
step 2.4: calculating the probability P (Y ═ C) that the mail sample (X, Y) is spamk|X):
Figure FDA0002440538320000013
The probability that the sample is the normal mail can be obtained in the same way, and the category of the sample (X, Y) is predicted by comparing the two probabilities;
step 2.5: comparing the predicted result with the actual result of the sample;
if weak classifier htCorrectly classify this sample, i.e. ht(X) ═ sign (y); calculating lambdat sc←λt sc+ λ, updating the correct classification weighting counter, where λ is the sample weight; computing
Figure FDA0002440538320000021
Updating approximate weighted misclassification ratest(ii) a And (3) calculating:
Figure FDA0002440538320000022
updating the weights of the samples (X, Y);
if weak classifier htMisclassifying the sample; calculating lambdat sw←λt sw+ lambda, update the misclassification weighting counter, calculate the formula in the same way
Figure FDA0002440538320000023
Updating the approximate weighted misclassification rate; computing
Figure FDA0002440538320000024
Updating the weights of the samples (X, Y);
step 2.6: calculating weak classifier htWeight α oftFinish the weak classifier htUpdating of (1);
Figure FDA0002440538320000025
step 2.7: inputting the updated sample into the next weak classifier, and repeatedly executing the steps 2.2 to 2.6 until all the weak classifiers are updated, completing one cycle and selecting the weak classifier with the highest weight;
step 2.8: judging whether the training of the mail samples in the training set of all the areas is finished or not; if not, returning to the step 2.1; if all mail samples in the training set are trained, integrating strong classifiers H (X) by using all selected weak classifiers;
Figure FDA0002440538320000026
and step 3: inputting the rest mail samples to be detected into a strong classifier H (X) to finish the detection of the junk mails.
CN202010265704.7A 2020-04-07 2020-04-07 Junk mail detection method based on online AdaBoost Pending CN111553388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010265704.7A CN111553388A (en) 2020-04-07 2020-04-07 Junk mail detection method based on online AdaBoost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010265704.7A CN111553388A (en) 2020-04-07 2020-04-07 Junk mail detection method based on online AdaBoost

Publications (1)

Publication Number Publication Date
CN111553388A true CN111553388A (en) 2020-08-18

Family

ID=72004269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010265704.7A Pending CN111553388A (en) 2020-04-07 2020-04-07 Junk mail detection method based on online AdaBoost

Country Status (1)

Country Link
CN (1) CN111553388A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221623A (en) * 2008-01-30 2008-07-16 北京中星微电子有限公司 Object type on-line training and recognizing method and system thereof
CN103593672A (en) * 2013-05-27 2014-02-19 深圳市智美达科技有限公司 Adaboost classifier on-line learning method and Adaboost classifier on-line learning system
CN107133597A (en) * 2017-05-11 2017-09-05 南宁市正祥科技有限公司 A kind of front vehicles detection method in the daytime
CN108537279A (en) * 2018-04-11 2018-09-14 中南大学 Based on the data source grader construction method for improving Adaboost algorithm
US20180293381A1 (en) * 2017-04-07 2018-10-11 Trustpath Inc. System and method for malware detection on a per packet basis
CN108804651A (en) * 2018-06-07 2018-11-13 南京邮电大学 A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN110149268A (en) * 2019-05-15 2019-08-20 深圳市趣创科技有限公司 A kind of method and its system of automatic fitration spam
CN110737805A (en) * 2019-10-18 2020-01-31 网易(杭州)网络有限公司 Method and device for processing graph model data and terminal equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221623A (en) * 2008-01-30 2008-07-16 北京中星微电子有限公司 Object type on-line training and recognizing method and system thereof
CN103593672A (en) * 2013-05-27 2014-02-19 深圳市智美达科技有限公司 Adaboost classifier on-line learning method and Adaboost classifier on-line learning system
US20180293381A1 (en) * 2017-04-07 2018-10-11 Trustpath Inc. System and method for malware detection on a per packet basis
CN107133597A (en) * 2017-05-11 2017-09-05 南宁市正祥科技有限公司 A kind of front vehicles detection method in the daytime
CN108537279A (en) * 2018-04-11 2018-09-14 中南大学 Based on the data source grader construction method for improving Adaboost algorithm
CN108804651A (en) * 2018-06-07 2018-11-13 南京邮电大学 A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN110149268A (en) * 2019-05-15 2019-08-20 深圳市趣创科技有限公司 A kind of method and its system of automatic fitration spam
CN110737805A (en) * 2019-10-18 2020-01-31 网易(杭州)网络有限公司 Method and device for processing graph model data and terminal equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEIMING HU等: "Online Adaboost-Based Parameterized Methods for Dynamic Distributed Network Intrusion Detection", 《IEEE TRANSACTIONS ON CYBERNETICS》 *
李茹: "基于最小风险贝叶斯的多级邮件过滤系统的研究实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
翟军昌: "基于朴素贝叶斯算法的个性化垃圾邮件过滤", 《长春师范学院学报(自然科学版)》 *

Similar Documents

Publication Publication Date Title
CN108256561B (en) Multi-source domain adaptive migration method and system based on counterstudy
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN109190442B (en) Rapid face detection method based on deep cascade convolution neural network
CN101937513B (en) Information processing apparatus and information processing method
CN104699772B (en) A kind of big data file classification method based on cloud computing
CN107392919B (en) Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method
CN108364016A (en) Gradual semisupervised classification method based on multi-categorizer
CN110378366B (en) Cross-domain image classification method based on coupling knowledge migration
JP7266674B2 (en) Image classification model training method, image processing method and apparatus
CN103559504A (en) Image target category identification method and device
CN105022754A (en) Social network based object classification method and apparatus
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN109919055B (en) Dynamic human face emotion recognition method based on AdaBoost-KNN
Asadi et al. Creating discriminative models for time series classification and clustering by HMM ensembles
CN103279746A (en) Method and system for identifying faces based on support vector machine
CN116503676B (en) Picture classification method and system based on knowledge distillation small sample increment learning
Solanki et al. Spam filtering using hybrid local-global Naive Bayes classifier
CN108198324B (en) A kind of multinational bank note currency type recognition methods based on finger image
CN107220663A (en) A kind of image automatic annotation method classified based on semantic scene
US20060179017A1 (en) Preparing data for machine learning
CN116644339B (en) Information classification method and system
CN112163069B (en) Text classification method based on graph neural network node characteristic propagation optimization
CN104281569A (en) Building device and method, classifying device and method and electronic device
CN111553388A (en) Junk mail detection method based on online AdaBoost
CN115828100A (en) Mobile phone radiation source spectrogram category increment learning method based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200818

RJ01 Rejection of invention patent application after publication