CN102035753A - Filter dynamic integration-based method for filtering junk mail - Google Patents

Filter dynamic integration-based method for filtering junk mail Download PDF

Info

Publication number
CN102035753A
CN102035753A CN2009102056212A CN200910205621A CN102035753A CN 102035753 A CN102035753 A CN 102035753A CN 2009102056212 A CN2009102056212 A CN 2009102056212A CN 200910205621 A CN200910205621 A CN 200910205621A CN 102035753 A CN102035753 A CN 102035753A
Authority
CN
China
Prior art keywords
filter
mail
text
classification
filters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102056212A
Other languages
Chinese (zh)
Other versions
CN102035753B (en
Inventor
王金龙
高珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Technology
Original Assignee
Qingdao University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Technology filed Critical Qingdao University of Technology
Priority to CN2009102056212A priority Critical patent/CN102035753B/en
Publication of CN102035753A publication Critical patent/CN102035753A/en
Application granted granted Critical
Publication of CN102035753B publication Critical patent/CN102035753B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a processing technology of a junk mail in the technical field of an electronic mail, in particular to a filter dynamic integration-based method for filtering the junk mail. The method comprises the following steps of: processing the junk mail by utilizing a text processing method; grouping filters and initially selecting the filters by a subscriber; and dynamically selecting the filters through time delay replacement control. The subscriber groups the filters according to certain classification principle and dynamically selects the filters from each group to be used for integration classification so that a plurality of filters are adequately matched with each other when working, therefore, the defect that the traditional multi-filter integration method cannot deal with the characteristic change of the mail is effectively overcome, and the accuracy and the stability for filtering the junk mail are improved.

Description

A kind of rubbish mail filtering method based on the filter dynamic integrity
Technical field
The present invention relates to the spam treatment technology in the e-mail technique field, relate in particular to a kind of rubbish mail filtering method based on the filter dynamic integrity.
Background technology
The development of ICT (information and communication technology) and the amount of information of bringing thus increase, the communication that has greatly promoted people with exchange, as the product of current information blast, spam takies a large amount of transmission, storage and calculation resources, caused the huge wasting of resources, also quite big in the extent of injury of others.
At present, anti-spam technologies mainly comprise the method based on agreement, rule-based method, based on the method for statistical machine study.Along with the online variation of mail becomes increasingly conspicuous, because need the predefined rule, often can't in time handle new spam form based on agreement and rule-based method.And have the advantage that grows with each passing hour based on the method for statistical learning, and become the emphasis and the focus of Recent study, particularly along with the improvement of feature selecting technology and machine learning algorithm, obtained excellent performance based on the rubbish mail filtering method of statistics.
In recent years, along with various omnifarious spams occur, the single classifier learning algorithm often can't adapt to its variation, for this reason, utilizes the incompatible raising classification performance of various algorithm groups to obtain paying close attention to widely.Yet, existing combination of filters method is just merely selected some filters effective when working independently, filter is not distinguished, sorted out, this filter that makes some have similar mechanism often can't fully cooperatively interact when work in combination, and the stability of filtration is not high.Simultaneously, in a single day existing combination of filters has selected certain compound mode, just no longer it is adjusted.Like this, As time goes on, spammer is by continuous conversion mail features, can avoid the detection of existing Integrated Solution easily, makes it lose discriminating power to the spam of new generation, and the accuracy that causes filtering descends.
Summary of the invention
In view of this, the invention provides a kind of rubbish mail filtering method, utilize filter grouping and dynamic-configuration integrated filter, overcome the defective of prior art, improve the accuracy and the stability of Spam filtering based on the filter dynamic integrity.
For achieving the above object, technical scheme of the present invention is achieved in that
A, spam is handled with text handling method;
B, user divide into groups to filter and initially choose filter;
C, replace control by time-delay filter is carried out Dynamic Selection.
By above-mentioned technical scheme as can be known, a kind of rubbish mail filtering method among the present invention based on the filter dynamic integrity, because the user divides into groups to filter according to certain classification principle, and dynamically from every group, choose filter and be used for integrated classification, make a plurality of filters when work, be cooperatively interacted fully, effectively overcome the deficiency of existing multiple filter integrated approach aspect the variation of reply mail features, improved the accuracy and the stability of Spam filtering.
Description of drawings
Fig. 1 is the schematic diagram of a kind of rubbish mail filtering method based on the filter dynamic integrity in the embodiment of the invention.
Fig. 2 is the flow chart of a kind of rubbish mail filtering method based on the filter dynamic integrity in the embodiment of the invention.
Fig. 3 is the flow chart of a kind of rubbish mail filtering method based on heterogeneous filter dynamic integrity in the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention express clearlyer, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.
Fig. 1 is the schematic diagram of a kind of rubbish mail filtering method based on the filter dynamic integrity in the example of the present invention.As shown in Figure 1, the present invention at first utilizes text handling method that mail is handled, and obtains result; Then, each filter that resulting text-processing result input is selected through user grouping is learnt and is classified; At last, according to classification results and user feedback filter is carried out choice of dynamical.
Fig. 2 is the flow chart of a kind of rubbish mail filtering method based on the filter dynamic integrity in the example of the present invention.As shown in Figure 2, a kind of rubbish mail filtering method based on the filter dynamic integrity comprises step as described below in the example of the present invention:
Step 201 is handled spam with text handling method.
Described text handling method comprises extraction, text participle, the Feature Selection of text, text vector mapping to the mail text.Concrete treatment step is as follows:
1) extraction of mail text
The mail original text is generally all encrypted, and has various character codes.Therefore, extract the mail text and need carry out following steps: mail is decrypted the Mail Contents after obtaining deciphering; Extract the character code of mail Chinese version, and utilize the conversion of encoding to unify the character code of text.At last, extract the text message of having unified coding.
2) text participle
For be similar to Chinese continuously every the language of, write the two or more syllables of a word together, understand its meaning in order to make some machine learning algorithms, need carry out word segmentation processing to it, find the feature of expression text.
3) Feature Selection of text
The Feature Selection method is represented by high dimensional data is mapped to low-dimensional, thereby the sparse property of minimizing data can be removed noise simultaneously to a certain extent, improves the performance of sorting algorithm.Thereby the Feature Selection method is an important data preprocessing method.Feature Selection method commonly used comprise document frequency (document frequency, DF), comentropy (information gain, IG) etc.
4) DUAL PROBLEMS OF VECTOR MAPPING
Calculate because some text classification algorithms need carry out similitude based on vector space model, therefore the vector input need be provided.Text vector mapping is that the text representation with mail is converted into vector representation, and the length of vector is that the training mail is concentrated the feature speech quantity that is occurred, the weight of characteristic of correspondence speech in each dimension size expression text of vector.Described training mail collection is meant the mail set that is used to train filter through mark.The computational methods commonly used of described feature speech weight have: two-value (binary), word frequency (term frequency, TF), the counter-rotating document frequency (inverse document frequency, IDF) etc.
After text set carried out above-mentioned preliminary treatment, import each categorical filtering device according to the information that the different requirements of categorical filtering device will meet.
Step 202, the user divides into groups to filter and initially chooses filter.
Described filter is divided into groups, be meant that the user can divide into groups to it according to the mechanism of filter, in each group, select filter at random during beginning as the preliminary classification device.
Step 203 is replaced control by time-delay filter is carried out Dynamic Selection.
In this step, the embodiment of the invention at first utilizes selected filter to carry out integrated classification based on the input that text handling method provided; Then, according to classification results and user's feedback, replace control by time-delay and dynamically filter is chosen.Concrete treatment step is as follows:
1) carries out integrated classification according to selected filter based on the input that text handling method provided.
The step of described integrated classification is as follows: at first, filter is by training acquisition classification transaction module separately; Then, utilize the classification transaction module that is obtained to treat mail classifying and judge score; Then, with all determination informations gather, integrated, obtain the final decision score; At last, the passing threshold strategy is rendered to mail in normal email or the spam inbox.
Wherein, filter is by training acquisition classification transaction module separately, be divided into two kinds of situations: the one,, before using certain filter to classify first, the mail that needs some band marks of study, obtain the input of filter by text handling method, in conjunction with the mail mark, training obtains the preliminary classification filter; The 2nd,, use certain filter to classify afterwards before, mail mark and the input of the text handling method respective mail that provide the classify training of transaction module of filter by user feedback.Integration mode mainly is divided into linear and non-linear two kinds, and wherein to be divided into simple arithmetic mean integrated, integrated etc. according to the weighted average of historical accuracy rate setting for linear Integrated, non-linearly is integrated with integrated etc. based on SVMs.
2), replace control by time-delay and dynamically filter is chosen according to classification results and user feedback.
The user marks a part of mail by consulting.On this basis, described feedback according to classification results and user, the concrete steps of choosing by time-delay replacement control decision filter next time are as follows: at first, the recording user mark, and be foundation with this information, in time range T1, calculate the accuracy of the filter that is moving; Then, be lower than the given threshold value E of user, then prepare filter is replaced as if one of them or the accuracy of several filter in time T 1; Then, consider to wait to replace the accuracy rate of filter accuracy in ensuing time range T2, if be lower than threshold value E, then to its replacement, other any a filter of picked at random in its place group.
Fig. 3 is the flow chart of a kind of rubbish mail filtering method based on heterogeneous filter dynamic integrity in the example of the present invention.The described heterogeneous operation principle difference that is meant filter, promptly the filter kernel core module based on the machine learning techniques difference.In example of the present invention, the filter that initially provides comprises: two kinds of filter SpamProbe and BogoFilter that judge based on Bayes, a kind of filter PPM (Prediction by Partial Matching) based on part coupling Predicting Technique, a kind of filter DMC (Dynamic Markov compression) based on dynamic Markov compress technique, a kind of improved filter ROSVM (Relaxed Online SVM) to traditional SVM, a kind of filter LR_trirls (Logistic Regression with truncated iteratively re-weighted least squares) that returns based on Logistic.
As shown in Figure 3, a kind of concrete implementation step of the rubbish mail filtering method based on heterogeneous filter dynamic integrity is as follows in the example of the present invention:
1) utilize text handling method that mail is handled.
Utilize text handling method to obtain the original text of mail extraction text, the mail extraction text that the mail behind the participle extracts text and vector representation respectively.
Wherein, the method for extraction text original text comprises decoding, removes steps such as label information, character set conversion, complicated and simple word conversion, title and text extraction.
Text participle step is on the above-mentioned mail text basis that obtains, and utilizes a day net participle program mail original text that extracts to be carried out word segmentation processing, the text behind the preservation participle.
Text vector represents it is after above-mentioned step through the text word segmentation processing, utilizes document frequency (DF) method to carry out feature selecting, and setting the reservation dimension is 1000, thereby obtains the result of feature selecting.According to the result of feature selecting, the mail text of the weight that adopts each dimension of two value representations vectors after with participle is mapped as vector, that is, when the feature speech appeared in the mail text, its weight was 1, otherwise was 0.
2) user divides into groups to filter and initially chooses filter.
The user will be set at same group based on the filter of identical operation principle,, will be divided into one group based on the filter ROSVM and the LR_trirls of differentiating method that is; To be divided into one group based on the filter SpamProbe and the BogoFilter of generation method; To be divided into one group based on the filter PPM and the DMC of compression method.And selected PPM, BogoFilter, ROSVM at random as initial integrated filter.
3) replace control by time-delay filter is carried out Dynamic Selection.
This method comprises two steps:
(a) carry out integrated classification according to selected filter based on the input that text handling method provided.
At first, with the mail text original text input filter PPM that extracts, will be through the mail text input filter BogoFilter of word segmentation processing, will be through the text input filter ROSVM of vector representation, three filter PPM, BogoFilter, ROSVM pass through training acquisition classification transaction module separately.Then, utilize the classification transaction module that is obtained, treat mail classifying with three filters and judge, respectively the interval spam probable value of output [0,1].Then, adopt simple arithmetic mean to divide integrated mode, only calculate the average S of all filters.At last, score S and preset threshold T=0.5 are compared,, mail is judged to be spam when score S surpasses threshold value T; Otherwise, then be judged to be normal email.
Wherein, the classification transaction module that obtains separately by training comprises two kinds of situations: the one,, before using above-mentioned three filters to classify first, utilize the mail of preprepared band mark, the preliminary classification transaction module of training PPM, BogoFilter, ROSVM; The 2nd,, use above-mentioned three filters to classify afterwards before, the classify training of transaction module of the input of the respective mail that mail mark by user feedback and text handling method provide.
(b), replace control by time-delay and dynamically filter is chosen according to classification results and user feedback.
At first, the user checks the mail of reception, and the mail of checking is carried out the classification mark.Then, user's mark and classification results as historical information, are calculated PPM, the BogoFilter in time range T1, the historical accuracy A of ROSVM.Then, be lower than the given threshold value E of user at the historical accuracy A of T1 in the time, then prepare filter is replaced as if one of them or several filter.At last, if wait that the historical accuracy A that replaces filter still is lower than E at ensuing T2 in the time, then replace this or this several filters and other any a filters in the group of the several filters of picked at random this or this place.
Wherein,
Figure B2009102056212D0000071
T1=24 hour, T2=12 hour, E=60%.
As mentioned above, just can realize the present invention preferably.

Claims (7)

1. the rubbish mail filtering method based on the filter dynamic integrity is characterized in that, this method may further comprise the steps:
A, spam is handled with text handling method;
B, user divide into groups to filter and initially choose filter;
C, replace control by time-delay filter is carried out Dynamic Selection.
2. method according to claim 1 is characterized in that, the text handling method described in the steps A comprises: to the Feature Selection and the text vector mapping of the extraction of mail text, text participle, text.
3. method according to claim 1 is characterized in that, filter is divided into groups described in the step B, is meant that the user can divide into groups to it according to the mechanism of filter.The described filter of initially choosing selects filter as the preliminary classification device in each group when being meant beginning at random.
4. method according to claim 1 is characterized in that step C specifically comprises:
C1, utilize selected filter to carry out integrated classification based on the input that text handling method provided;
C2, according to classification results and user feedback, replace control by time-delay and dynamically filter chosen.
5. method according to claim 4 is characterized in that, the concrete steps of the integrated classification described in the step C1 comprise:
C11, filter are by training acquisition classification transaction module separately;
The classification transaction module that C12, utilization are obtained is treated mail classifying and is judged score;
C13, with all determination informations gather, integrated, obtain the final decision score;
C14, passing threshold strategy are rendered to mail in normal email or the spam inbox.
6. method according to claim 5, it is characterized in that, step C11 comprises two kinds of situations: the one,, before using certain filter to classify first, the mail that needs some band marks of study, obtain the input of filter by text handling method, in conjunction with the mail mark, training obtains the preliminary classification filter; The 2nd,, use certain filter to classify afterwards before, mail mark and the input of the text handling method respective mail that provide the classify training of transaction module of filter by user feedback.
7. method according to claim 4 is characterized in that step C2 specifically comprises: at first, and recording user mark, and be foundation with this information, in time range T1, calculate the accuracy of the filter that is moving; Then, be lower than the given threshold value E of user, then prepare filter is replaced as if one of them or the accuracy of several filter in time T 1; Then, consider to wait to replace the accuracy rate of filter accuracy in ensuing time range T2, if be lower than threshold value E, then to its replacement, other any a filter of picked at random in its place group.
CN2009102056212A 2009-10-02 2009-10-02 Filter dynamic integration-based method for filtering junk mail Expired - Fee Related CN102035753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102056212A CN102035753B (en) 2009-10-02 2009-10-02 Filter dynamic integration-based method for filtering junk mail

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102056212A CN102035753B (en) 2009-10-02 2009-10-02 Filter dynamic integration-based method for filtering junk mail

Publications (2)

Publication Number Publication Date
CN102035753A true CN102035753A (en) 2011-04-27
CN102035753B CN102035753B (en) 2012-07-11

Family

ID=43888108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102056212A Expired - Fee Related CN102035753B (en) 2009-10-02 2009-10-02 Filter dynamic integration-based method for filtering junk mail

Country Status (1)

Country Link
CN (1) CN102035753B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108512828A (en) * 2018-02-13 2018-09-07 论客科技(广州)有限公司 Mail piece identifiers and filter method, device, server based on address list and system
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN113938266A (en) * 2021-09-18 2022-01-14 桂林电子科技大学 Junk mail filter training method and system based on integer vector homomorphic encryption

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7519668B2 (en) * 2003-06-20 2009-04-14 Microsoft Corporation Obfuscation of spam filter
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN108512828A (en) * 2018-02-13 2018-09-07 论客科技(广州)有限公司 Mail piece identifiers and filter method, device, server based on address list and system
CN113938266A (en) * 2021-09-18 2022-01-14 桂林电子科技大学 Junk mail filter training method and system based on integer vector homomorphic encryption
CN113938266B (en) * 2021-09-18 2024-03-26 桂林电子科技大学 Junk mail filter training method and system based on integer vector homomorphic encryption

Also Published As

Publication number Publication date
CN102035753B (en) 2012-07-11

Similar Documents

Publication Publication Date Title
CN101184259B (en) Keyword automatically learning and updating method in rubbish short message
CN101257671B (en) Method for real time filtering large scale rubbish SMS based on content
Sriram et al. Short text classification in twitter to improve information filtering
CN101645069B (en) Regular expression storage compacting method in multi-mode matching
CN102255922A (en) Intelligent multilevel junk email filtering method
CN103024746A (en) System and method for processing spam short messages for telecommunication operator
Alzahrani et al. Comparative study of machine learning algorithms for SMS spam detection
CN103279479A (en) Emergent topic detecting method and system facing text streams of micro-blog platform
CN102035753B (en) Filter dynamic integration-based method for filtering junk mail
CN101360074B (en) Method and system determining suspicious spam range
CN106649338B (en) Information filtering strategy generation method and device
Rifat et al. Bert against social engineering attack: Phishing text detection
CN109558486A (en) Electric power customer service client's demand intelligent identification Method
CN106897423A (en) A kind of cloud platform junk data processing method and system
CN107992508B (en) Chinese mail signature extraction method and system based on machine learning
CN105721539A (en) Short message classification apparatus and method based on behavior features
Luo et al. Design and implement a rule-based spam filtering system using neural network
Manjusha et al. Spam mail classification using combined approach of bayesian and neural network
JP4686724B2 (en) E-mail system with spam filter function
Behjat et al. A PSO-Based Feature Subset Selection for Application of Spam/Non-spam Detection
Yin et al. An improved bayesian algorithm for filtering spam e-mail
Charninda et al. Content based hybrid sms spam filtering system
CN103684991A (en) Junk mail filtering method based on mail features and content
Manek et al. ReP-ETD: A Repetitive Preprocessing technique for Embedded Text Detection from images in spam emails
Karishma et al. Spam Detection using Recurrent Neural Networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120711

Termination date: 20131002