CN105871887A

CN105871887A - Client-side based personalized E-mail filtering system and method

Info

Publication number: CN105871887A
Application number: CN201610316436.0A
Authority: CN
Inventors: 谭营; 高扬; 米古月
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-05-12
Filing date: 2016-05-12
Publication date: 2016-08-17
Anticipated expiration: 2036-05-12
Also published as: CN105871887B

Abstract

The invention discloses a client-based personalized email filtering system and filtering method. The system includes a receiving module, a filtering and updating module, and a display module; the receiving module receives emails and preprocesses emails; the filtering and updating module includes a database, Condition matcher and intelligent detection classifier; the database is a training data set; the condition matcher is used for users to set filter conditions, filter emails according to the filter conditions, and then use the classifier to detect and classify emails, and at the same time use the received emails to classify The training data set of the filter is updated in real time, thereby realizing personalized email filtering and classification; the display module displays the results of email filtering and classification. The technical solution provided by the invention not only has diversified filtering methods and good performance, but also meets the requirements of real-time and individualization.

Description

Client-based Personalized Email Filtering System and Filtering Method

技术领域technical field

本发明涉及邮件过滤技术，尤其涉及一种基于客户端的个性化电子邮件过滤系统和过滤方法。The invention relates to mail filtering technology, in particular to a client-based personalized email filtering system and filtering method.

背景技术Background technique

目前垃圾邮件过滤方法大多基于两种特征提取方法。其中，一种依赖传统的统计学，通过分析处理待选特征词的统计信息，将其依据可区分性排序，提取出可区分性良好的特征词；这种方法虽然可以提取出大量的有效特征，然而，由于缺乏对这些特征的进一步处理，导致特征向量维度过高，增加了计算的复杂度。Most of the current spam filtering methods are based on two feature extraction methods. Among them, one relies on traditional statistics, analyzes and processes the statistical information of the feature words to be selected, sorts them according to their distinguishability, and extracts feature words with good distinguishability; although this method can extract a large number of effective features , however, due to the lack of further processing of these features, the feature vector dimension is too high, which increases the computational complexity.

另一种基于人工免疫系统，结合免疫思想，模拟生物抗体的生成过程，提取出具备启发性的特征。然而，该类方法侧重于启发式规则的建立，而较少利用统计学理论分析所提取的特征有效性。The other is based on the artificial immune system, combined with immune thinking, simulates the production process of biological antibodies, and extracts instructive features. However, this type of method focuses on the establishment of heuristic rules, and less use of statistical theory to analyze the validity of the extracted features.

目前的垃圾邮件过滤方法，多采用已有数据集来训练，难以做到根据接收到的邮件进行数据集的实时更新。现有邮件客户端采用的垃圾邮件过滤方法大多是在服务器端进行过滤，然后在客户端上对邮件进行分类显示。这种在服务器端进行过滤的方法，需要采集很多用户的使用情况后，才能进行邮件数据集的更新，从而导致实时性较差。同时，因为在服务器端统一进行邮件过滤，所有用户的邮件过滤效果都相似甚至相同，造成用户个性化的需求难以被满足。The current spam filtering methods mostly use existing data sets for training, and it is difficult to update the data sets in real time based on received emails. Most of the spam filtering methods adopted by the existing mail clients are to filter on the server side, and then classify and display the mails on the client side. This method of filtering on the server side needs to collect the usage of many users before updating the mail data set, resulting in poor real-time performance. At the same time, because mail filtering is performed uniformly on the server side, the mail filtering effects of all users are similar or even the same, making it difficult to meet the personalized needs of users.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提供一种基于客户端的个性化电子邮件过滤系统和过滤方法，通过计算免疫浓度特征，由于不同用户收到的邮件各不相同，采用在本地客户端进行训练学习的方法，用户接收到每封邮件都会对训练数据集进行更新，由此实现个性化电子邮件过滤。In order to overcome the deficiencies in the prior art above, the present invention provides a client-based personalized email filtering system and filtering method. By calculating the immune concentration characteristics, since the emails received by different users are different, the local client is used to In the method of training and learning, the training data set will be updated every time the user receives an email, thereby realizing personalized email filtering.

本发明提供的技术方案是：The technical scheme provided by the invention is:

一种基于客户端的个性化电子邮件过滤系统，包括接收模块、过滤和更新模块、显示模块；A client-based personalized email filtering system, including a receiving module, a filtering and updating module, and a display module;

所述接收模块用于接收邮件，再将接收到的邮件进行预处理，并将预处理结果传给过滤模块；The receiving module is used to receive emails, then preprocess the received emails, and pass the preprocessing results to the filtering module;

所述过滤和更新模块包括数据库、条件匹配器和智能检测分类器；数据库为存储在本地的训练数据集；条件匹配器用于用户设置过滤条件，根据过滤条件对接收到的邮件进行过滤；同时利用智能检测分类器对接收到的邮件进行检测分类，得到接收到邮件的分类，并利用接收到的邮件对智能检测分类器的训练数据集进行实时更新，从而针对每个用户建立其特有的训练数据集，使得智能检测时的分类器因用户而异，由此实现个性化的电子邮件过滤分类；Described filter and update module comprises database, condition matcher and intelligent detection classifier; Database is the training data set stored locally; Condition matcher is used for user to set filter condition, according to filter condition, the mail that receives is filtered; The intelligent detection classifier detects and classifies the received emails, obtains the classification of the received emails, and uses the received emails to update the training data set of the intelligent detection classifier in real time, so as to establish its unique training data for each user Set, so that the classifier used in intelligent detection varies from user to user, thereby realizing personalized email filtering and classification;

所述显示模块将电子邮件过滤分类的结果显示出来。The display module displays the results of email filtering and classification.

本发明具体使用JAVA语言编程实现上述客户端系统；通过调用WaikatoEnvironment for Knowledge Analysis(Weka)的函数库来实现分类器训练和分类。用户设置的过滤条件包括关键词过滤条件和发件人地址过滤条件等。The present invention specifically uses JAVA language programming to realize the above-mentioned client system; realizes classifier training and classification by calling the function library of WaikatoEnvironment for Knowledge Analysis (Weka). The filter conditions set by the user include keyword filter conditions, sender address filter conditions, and the like.

本发明还提供一种基于客户端的个性化电子邮件过滤方法，分为训练阶段和过滤阶段，本方法基于免疫浓度特征，采用在本地进行训练学习的方法，针对用户接收到的每封邮件，通过本地数据集的实时更新，得到每个用户个性化的训练数据集，实现不同用户个性化的邮件过滤要求，从而解决邮件过滤的实时性和个性化的问题；具体包括如下步骤：The present invention also provides a client-based personalized e-mail filtering method, which is divided into a training stage and a filtering stage. The method is based on the immune concentration feature, adopts a method of training and learning locally, and for each e-mail received by the user, through The real-time update of the local data set obtains a personalized training data set for each user, and realizes the personalized mail filtering requirements of different users, thereby solving the real-time and personalized problems of mail filtering; the specific steps are as follows:

1)在训练阶段，执行如下步骤：1) In the training phase, perform the following steps:

11)针对已有的电子邮件数据集，根据词的信息量和倾向度，生成两类检测器集合，分别为正常邮件检测器集合和垃圾邮件检测器集合；11) For the existing email data set, according to the amount of information and the degree of tendency of words, two types of detector sets are generated, which are respectively normal mail detector set and spam detector set;

12)针对已有的电子邮件数据集，利用步骤11)构建好的检测器集合，构建免疫浓度特征向量，得到所述电子邮件数据集中的每封邮件对应的免疫浓度特征向量；12) For the existing email data set, use step 11) to construct the detector set to construct the immune concentration feature vector, and obtain the immune concentration feature vector corresponding to each mail in the email data set;

13)利用步骤12)得到的每封邮件对应的免疫浓度特征向量，训练分类器，得到训练好的分类器模型；13) Utilize the immune concentration feature vector corresponding to each mail obtained in step 12), train the classifier, and obtain the trained classifier model;

2)在过滤阶段，执行如下步骤：2) In the filtering stage, perform the following steps:

21)对接收邮件进行预处理，对接收到的每封邮件进行解析，得到所述邮件的标题、正文、收件人地址、发件人地址，所述标题、收件人地址、发件人地址，设置过滤条件(包括标题过滤条件、收发地址过滤条件等)，用于进行邮件分类；对所述正文进行分词，每封邮件均被划分成多个特征词；21) Preprocessing the received emails, parsing each received email to obtain the title, text, recipient address, sender address of the email, the title, recipient address, sender Address, setting filter conditions (including title filter conditions, sending and receiving address filter conditions, etc.), used to classify mails; word segmentation is carried out to the text, and each mail is divided into a plurality of characteristic words;

22)对接收邮件进行分类过滤，执行如下操作:22) Classify and filter the received mail, perform the following operations:

221)对接收到的每封邮件，利用步骤11)构建好的检测器集合，将接收到的每封邮件重构成相应的免疫浓度特征向量，得到接收到的每封邮件对应的免疫浓度特征向量；221) For each received email, use the detector set constructed in step 11) to reconstruct each received email into a corresponding immune concentration feature vector, and obtain the immune concentration feature vector corresponding to each received email ;

222)利用步骤13)所述分类器模型对邮件进行分类，得到分类结果；222) Utilize the classifier model described in step 13) to classify the mail to obtain the classification result;

223)根据分类结果和用户设置的过滤条件，对接收邮件进行过滤处理，得到过滤处理结果；223) According to the classification result and the filter condition set by the user, filter the received mail to obtain the filter processing result;

23)根据用户交互操作进行实时更新并显示出来，包括如下情况：23) Update and display in real time according to user interaction operations, including the following situations:

23a)当接收邮件被归类为垃圾邮件时，所述邮件进入“垃圾邮件箱”；23a) When a received email is classified as spam, said email enters the "junk mail box";

23b)当接收邮件被归类为正常邮件时，所述邮件进入“收件箱”；23b) When the received mail is classified as normal mail, said mail enters the "inbox";

23c)当用户发现垃圾邮件箱中存在正常邮件，或者收件箱中存在垃圾邮件时，用户可手动将错分的邮件重新分类；对所述重新分类的邮件，进行分词处理得到分词，转入步骤1)用所述分词更新检测器集合，并依次重新构建免疫浓度特征向量和训练分类器。23c) When the user finds that there are normal emails in the spam box, or there are spam emails in the inbox, the user can manually reclassify the misclassified emails; perform word segmentation processing on the reclassified emails to obtain word segmentation, and transfer to Step 1) Use the word segmentation to update the detector set, and sequentially rebuild the immune concentration feature vector and train the classifier.

针对上述过滤方法，进一步地，步骤11)所述词的信息量和倾向度分别通过词筛选方法和倾向度计算方法计算得到；所述词筛选方法具体是：For above-mentioned filter method, further, step 11) the information content and the degree of tendency of described word are calculated by word screening method and degree of tendency calculating method respectively; Described word screening method is specifically:

对于已有的电子邮件数据集，通过式1计算得到所有特征词的信息增益I(t)，并将所有特征词根据信息增益I(t)的大小进行排序，将排序位于前m％的特征词添加到基因库；在本发明实施例中，优选地，m取值为50。For the existing email data set, the information gain I(t) of all feature words is calculated by formula 1, and all feature words are sorted according to the size of the information gain I(t), and the features in the top m% are sorted words are added to the gene pool; in the embodiment of the present invention, preferably, the value of m is 50.

上式中，P(C_i)表示C_i类别的文档在数据集中的频率；P(t)表示数据集中含有特征词t的文档的概率；表示数据集中不含有特征词t的文档的概率；P(C_i|t)表示某文档在特征词t出现的前提下，其属于类别C_i的概率；表示在特征词t不出现的前提下，该文档属于类别C_i的概率。In the above formula, P(C _i ) represents the frequency of documents of category C _i in the data set; P(t) represents the probability of documents containing feature word t in the data set; Indicates the probability of a document that does not contain a feature word t in the data set; P(C _i |t) indicates the probability that a document belongs to category C _i under the premise that the feature word t appears; Indicates the probability that the document belongs to category C _i under the premise that the feature word t does not appear.

所述倾向度计算具体是：对于所述基因库中的每个特征词，计算该特征词在垃圾邮件中出现的频率和正常邮件中出现的频率；当该特征词在正常邮件中出现的频率大于该特征词在垃圾邮件中出现的频率时，将该特征词记入正常邮件检测器集合；当该特征词在垃圾邮件中出现的频率大于该特征词在正常邮件中出现的频率时，将该特征词记入垃圾邮件检测器集合；当两者频率相等时，该特征词不纳入任何检测器集合。由此生成两类检测器集合。The calculation of the tendency degree is specifically: for each characteristic word in the gene pool, calculate the frequency of occurrence of the characteristic word in spam and normal mail; when the frequency of occurrence of the characteristic word in normal mail When the frequency of the characteristic word in spam is greater than the frequency of occurrence of the characteristic word in the normal mail detector set; when the frequency of the characteristic word in spam is greater than the frequency of occurrence of the characteristic word in normal mail, the The feature word is recorded in the spam detector set; when the two frequencies are equal, the feature word is not included in any detector set. This results in two types of detector sets.

针对上述过滤方法，进一步地，步骤12)所述构建免疫浓度特征向量，具体方法是：对电子邮件数据集中的每封邮件的不同特征词出现在垃圾邮件检测器集合和正常邮件检测器集合的数量进行计数；设N表示每封邮件中不同特征词的个数，S表示每封封邮件中出现在垃圾邮件检测器集合的特征词数量，L表示每封邮件中出现在正常邮件检测器集合的特征词数量；构建得到一个二维向量，记作(S/N,L/N)，作为免疫浓度特征向量，由此得到所述电子邮件数据集中的每封邮件对应的免疫浓度特征向量。For the above-mentioned filtering method, further, step 12) described in the construction of the immune concentration feature vector, the specific method is: the different feature words of each mail in the email data set appear in the spam detector set and the normal mail detector set Let N represent the number of different feature words in each email, S represent the number of feature words that appear in the spam detector set in each email, and L represent the number of feature words that appear in the normal email detector set in each email The number of feature words; construct a two-dimensional vector, denoted as (S/N, L/N), as the immune concentration feature vector, thus obtain the immune concentration feature vector corresponding to each mail in the email data set.

针对上述过滤方法，进一步地，所述分类器采用支持向量机SVM。For the above filtering method, further, the classifier adopts a support vector machine (SVM).

针对上述过滤方法，进一步地，在步骤13)所述训练的过程中，使用二次规划方法对分类器进行参数优化。For the above filtering method, further, in the training process described in step 13), the parameters of the classifier are optimized using the quadratic programming method.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

现有的垃圾邮件过滤方法，大多采用已有数据集来训练，并且数据集很少做到根据接收到的邮件进行实时更新。因为它们采用的方法是在服务器端进行过滤，而服务器中的数据集需要采集很多用户的使用情况以后，才可能进行更新。Most of the existing spam filtering methods use existing data sets for training, and the data sets are rarely updated in real time according to the received emails. Because the method they adopt is to filter on the server side, and the data set in the server needs to collect the usage conditions of many users before it can be updated.

本发明提供的垃圾邮件过滤方法通过对每封邮件构建免疫浓度特征向量，能够有效提取邮件的特征，从而提高分类性能，提升垃圾邮件过滤效果。基于免疫浓度特征方法，在具有较好过滤效果的基础上，针对每个用户接收到的邮件各不相同的特点，在本地对每个用户建立个性化的邮件检测分类器，从而实现一种个性化的垃圾邮件过滤客户端。客户端系统还包括其它基于规则的过滤方法，比如白名单、关键词等方法，使得过滤方法多样化，从而提升系统整体性能。本发明提供的客户端是在本地进行训练学习，用户接收到每封邮件，都会对训练数据集进行更新。而不同用户收到的邮件各不相同，所以这种本地数据集的实时更新，可以实现每个用户都不同的个性化的训练数据集，从而实现不同用户之间个性化的垃圾邮件过滤效果；解决实时性和个性化的问题。The spam filtering method provided by the invention can effectively extract the features of the mail by constructing the immune concentration feature vector for each mail, thereby improving the classification performance and improving the spam filtering effect. Based on the immune concentration feature method, on the basis of better filtering effect, according to the characteristics of the mails received by each user, a personalized mail detection classifier is established locally for each user, so as to realize a personalized A modernized spam filtering client. The client system also includes other rule-based filtering methods, such as whitelists, keywords, etc., which diversify the filtering methods and improve the overall performance of the system. The client provided by the present invention performs training and learning locally, and the user will update the training data set when receiving each email. The emails received by different users are different, so the real-time update of this local data set can realize the personalized training data set that is different for each user, so as to realize the personalized spam filtering effect among different users; Solve the problems of real-time and personalization.

综上，本发明提供的技术方案，一方面过滤方法多样化、性能好(垃圾邮件过滤的准确率、召回率、F度量值等指标能达到98％以上)，另一方面能满足实时性和个性化的要求。To sum up, the technical scheme provided by the present invention, on the one hand, has a variety of filtering methods and good performance (indicators such as the accuracy rate, recall rate, and F measurement value of spam filtering can reach more than 98%); on the other hand, it can meet real-time performance and individual requirements.

附图说明Description of drawings

图1是本发明提供的基于免疫浓度特征的过滤方法的流程框图。Fig. 1 is a flow chart of the filtering method based on the immune concentration feature provided by the present invention.

图2是本发明实施例实现的基于免疫浓度的垃圾邮件客户端系统的结构框图。Fig. 2 is a structural block diagram of a spam client system based on immune concentration implemented by an embodiment of the present invention.

图3为本发明实施例中客户端系统登录以后的主界面截图。Fig. 3 is a screenshot of the main interface of the client system after login in the embodiment of the present invention.

图4为本发明实施例中客户端系统的邮件阅读界面截图。Fig. 4 is a screenshot of the mail reading interface of the client system in the embodiment of the present invention.

图5为本发明实施例中客户端系统的过滤功能设置界面截图。Fig. 5 is a screenshot of the filtering function setting interface of the client system in the embodiment of the present invention.

具体实施方式detailed description

下面结合附图，通过实施例进一步描述本发明，但不以任何方式限制本发明的范围。Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.

本发明提供了一种基于免疫浓度特征的垃圾邮件过滤方法,提出一种新的免疫浓度特征提取方法，并将该方法应用到电子邮件客户端系统。该系统支持多个账户同时登录，从邮件服务器读取用户邮件，提取没封邮件的浓度特征，并使用分类器产生相应的邮件分类结果。基于免疫浓度特征的垃圾邮件过滤方法可分为训练阶段和过滤阶段，训练阶段将训练数据集输入分类器，对分类器的参数进行学习和优化，最终得到最优效果下的分类器；过滤阶段将训练得到的分类器用于本客户端中接收的邮件；具体步骤包括：The invention provides a spam filtering method based on the immune concentration feature, proposes a new immune concentration feature extraction method, and applies the method to an email client system. The system supports multiple accounts to log in at the same time, reads user emails from the mail server, extracts the concentration features of each email, and uses the classifier to generate corresponding email classification results. The spam filtering method based on the immune concentration feature can be divided into the training stage and the filtering stage. In the training stage, the training data set is input into the classifier, and the parameters of the classifier are learned and optimized, and finally the classifier with the optimal effect is obtained; the filtering stage Apply the trained classifier to the emails received by this client; the specific steps include:

S1)将已有电子邮件集合作为数据集，从中提取免疫浓度特征向量，输入分类器并进行训练和学习，生成分类器模型；本发明实施例中采用SVM作为分类器；S1) Use the existing email collection as a data set, extract the immune concentration feature vector therefrom, input the classifier and perform training and learning, and generate a classifier model; SVM is used as the classifier in the embodiment of the present invention;

S2)各用户收到邮件后，对各用户的邮件分别进行解析，得到邮件的标题、正文和收件人与发件人地址；S2) After each user receives the mail, the mail of each user is analyzed separately to obtain the title, text, address of the recipient and the sender of the mail;

S3)将邮件的正文进行分词，根据分词后的邮件正文、检测器集生成免疫浓度特征向量，并使用S1中生成的分类器模型对邮件进行分类。S3) Segment the text of the mail, generate an immune concentration feature vector according to the text of the mail after word segmentation and the detector set, and use the classifier model generated in S1 to classify the mail.

本发明提供的基于免疫浓度特征的过滤方法的具体实施流程如图1所示，对接收的每一封邮件，分别进行解析，得到邮件标题、发信人地址和邮件正文。解析之后得到的邮件标题和发信人地址等部分，通过匹配用户设置的过滤条件进行过滤，包括关键词过滤、发件人地址过滤等；解析之后的邮件正文部分，在进行分词之后构建免疫浓度特征，计算分类结果。最终将用户设置的过滤条件，以及分类器的分类结果相结合，对客户端系统中的邮件统一进行过滤。依据本发明提供的过滤方法,以下实施例建立了基于免疫浓度的垃圾邮件客户端系统，该系统使用JAVA语言编程实现，调用了Weka的函数库来实现分类器训练和分类。图2是本发明实施例实现的基于免疫浓度的垃圾邮件客户端系统的结构框图，系统主要包括三大模块：接收模块、过滤模块和显示模块。接收模块将接收到的邮件进行预处理，并将预处理结果传给过滤模块。过滤模块通过过滤条件和智能检测分类方法，对用户收到的邮件进行过滤，同时实时更新分类器，实现个性化的分类。显示模块将过滤结果显示出来，垃圾邮件进入垃圾邮件箱。系统具体实现步骤如下：The specific implementation process of the filtering method based on the immune concentration feature provided by the present invention is shown in Figure 1. Each received email is analyzed separately to obtain the email title, sender's address and email text. After parsing, the email title and sender address and other parts are filtered by matching the filter conditions set by the user, including keyword filtering, sender address filtering, etc.; after parsing the email body part, the immune concentration feature is constructed after word segmentation , to calculate the classification result. Finally, the filter conditions set by the user and the classification results of the classifier are combined to filter the mails in the client system uniformly. According to the filtering method provided by the present invention, the following examples establish a spam client system based on immune concentration, which is implemented using JAVA language programming, and calls Weka's function library to realize classifier training and classification. Fig. 2 is a structural block diagram of a spam client system based on immune concentration implemented by an embodiment of the present invention. The system mainly includes three modules: a receiving module, a filtering module and a display module. The receiving module preprocesses the received mail, and passes the preprocessing result to the filtering module. The filter module filters the emails received by users through filter conditions and intelligent detection and classification methods, and at the same time updates the classifier in real time to realize personalized classification. The display module displays the filtering results, and the spam enters the spam box. The specific implementation steps of the system are as follows:

第一步：构建检测器集合；The first step: build a detector set;

检测器集合(检测集)是一种检测器的集合，本发明中分为两种，一种是垃圾邮件检测器集合，另一种是正常邮件检测器集合。其中，通过计算特征词对两类邮件的倾向度，将更倾向于出现在垃圾邮件中的特征词归入垃圾邮件检测器集合，将更倾向于出现在正常邮件中的特征词归入正常邮件检测器集合。The detector set (detection set) is a set of detectors, which is divided into two types in the present invention, one is a spam detector set, and the other is a normal email detector set. Among them, by calculating the tendency of the feature words to the two types of emails, the feature words that are more likely to appear in spam emails are classified into the spam detector set, and the feature words that are more likely to appear in normal emails are classified into normal emails. collection of detectors.

在检测器集合的生成阶段，主要工作在于将词筛选算法与倾向度函数相结合，根据特征词的信息量(本实施例中采用信息增益作为信息量的衡量指标，具体计算方式见下文)和倾向度来生成两种类别的检测器集合。具体地：In the generation stage of the detector set, the main work is to combine the word screening algorithm with the propensity function, according to the amount of information of the feature words (the information gain is used as the measure of the amount of information in this embodiment, see below for the specific calculation method) and propensity to generate an ensemble of detectors for both classes. specifically:

11)词筛选方法：对于已有的电子邮件数据集，将邮件正文分词处理后得到各个特征词。对于邮件正文的分词，本实施例中的具体实施方式是，将每个汉字作为一个特征词，每个单词作为一个特征词，比如，“城市”划分为“城”和“市”两个特征词。分词完成后，每封邮件被划分成了N个特征词。计算所有特征词的信息增益I(t)，其计算公式如式1所示，并将所有特征词根据信息增益I(t)的大小进行排序。将信息增益排序中排名位于前m％的特征词添加到基因库，实验证明m＝50时效果最佳；11) Word screening method: For the existing email data set, each feature word is obtained after word segmentation of the email body. For the word segmentation of the mail text, the specific implementation method in this embodiment is to use each Chinese character as a feature word, and each word as a feature word, for example, "city" is divided into two features of "city" and "city" word. After the word segmentation is completed, each email is divided into N feature words. Calculate the information gain I(t) of all feature words, the calculation formula is shown in formula 1, and sort all feature words according to the size of information gain I(t). Add the feature words ranked in the top m% in the information gain sorting to the gene pool, and the experiment proves that the effect is the best when m=50;

上式中，C_i代表邮件i的类别(正常邮件或垃圾邮件)；P(C_i)表示C_i类别(正常邮件或垃圾邮件)的文档在数据集中的频率；P(t)表示数据集中含有特征词t的文档的概率；表示数据集中不含有特征词t的文档的概率；P(C_i|t)表示某文档在特征词t出现的前提下，其属于类别C_i的概率；表示在特征词t不出现的前提下，该文档属于类别C_i的概率。In the above formula, C _i represents the category of mail i (normal mail or spam); P(C _i ) represents the frequency of documents of C _i category (normal mail or spam) in the data set; P(t) represents the frequency of documents in the data set The probability of a document containing a feature word t; Indicates the probability of a document that does not contain a feature word t in the data set; P(C _i |t) indicates the probability that a document belongs to category C _i under the premise that the feature word t appears; Indicates the probability that the document belongs to category C _i under the premise that the feature word t does not appear.

12)倾向度计算：对于基因库中的每个特征词，计算该特征词在每个检测集(本实施例为垃圾邮件和正常邮件)中出现的频率。12) Calculation of propensity: For each feature word in the gene pool, calculate the frequency of occurrence of the feature word in each detection set (in this embodiment, spam and normal mail).

在垃圾邮件中出现频率更大的特征词，记入垃圾邮件检测器集合DS_S；在正常邮件中出现频率更大的特征词，记入正常邮件检测器集合DS_L；我们认为，在垃圾邮件中出现频率更大的特征词，应归属于垃圾邮件检测器集；在正常邮件中出现频率中更大的特征词，应该归属于正常邮件检测器集)。The feature words that appear more frequently in spam will be included in the spam detector set D _S ; the feature words that appear more frequently in normal emails will be included in the normal email detector set D _L ; The feature words that appear more frequently in normal emails should belong to the spam detector set; the feature words that appear more frequently in normal emails should belong to the normal email detector set).

第二步：构建免疫浓度特征向量；The second step: construct the immune concentration feature vector;

对于已有的电子邮件数据集，计数每封邮件的不同特征词出现在垃圾邮件检测器集合DS_S和正常邮件检测器集合DS_L的数量。设N表示每邮件中不同特征词的个数，S表示每封邮件中出现在垃圾邮件检测器集合的特征词数量，L表示每封邮件中出现在正常邮件检测器集合的特征词数量。则构建的免疫浓度特征向量是一个二维向量：(S/N,L/N)。For the existing email data set, count the number of different feature words of each email appearing in the spam detector set DS _S and the normal email detector set _DSL . Let N represent the number of different feature words in each email, S represent the number of feature words in each email that appear in the spam detector set, and L represent the number of feature words in each email that appear in the normal email detector set. Then the constructed immune concentration feature vector is a two-dimensional vector: (S/N, L/N).

第三步：训练分类器Step 3: Train the classifier

上一步将每封邮件重构成了对应的免疫浓度特征向量，利用这些特征向量对分类器进行训练。本实施例中的分类器选择支持向量机(SVM)。训练的过程中，使用二次规划方法对分类器模型进行参数优化。In the previous step, each email was reconstructed into the corresponding immune concentration feature vector, and the classifier was trained using these feature vectors. The classifier in this example is a support vector machine (SVM). During the training process, the parameters of the classifier model are optimized using the quadratic programming method.

第四步：客户端系统对接收邮件的预处理Step 4: Preprocessing of received mail by the client system

图4为本发明实施例中客户端系统的邮件阅读界面截图，如图4所示，客户端系统收到邮件后，对邮件进行解析，得到邮件的标题、正文和收件人与发件人地址；其中标题、收发地址可通过用户设置的过滤条件进行基于过滤条件的邮件过滤；邮件正文实现分词后，用于上一步训练好的分类器模型；Fig. 4 is the screenshot of the mail reading interface of the client system in the embodiment of the present invention. As shown in Fig. 4, after the client system receives the mail, it parses the mail to obtain the title, text, addressee and sender of the mail Address; the title and sending and receiving addresses can be filtered based on the filter conditions set by the user; after the text of the email is word-segmented, it is used for the classifier model trained in the previous step;

第五步：客户端系统对邮件进行分类过滤Step 5: The client system classifies and filters emails

上一步中，客户端系统中的每封邮件都被划分成了多个特征词。打开客户端系统的过滤功能，具体如图5所示。此时再次利用第一步中构建好的检测器集合，按照第二步中的方法，将每封邮件重构成相应的免疫浓度特征向量，并使用第三步中训练好的分类器模型对邮件进行分类。最后，根据分类结果，以及用户对标题、收发地址等设置的过滤条件(比如匹配发件人地址是否存在于黑名单列表中，或者标题中是否含有某些关键字)，对客户端系统中的邮件进行过滤处理，其结果显示在客户端中如图3所示。In the previous step, each email in the client system was divided into multiple feature words. Enable the filtering function of the client system, as shown in Figure 5. At this time, use the detector set built in the first step again, and reconstruct each email into the corresponding immune concentration feature vector according to the method in the second step, and use the classifier model trained in the third step to classify the email sort. Finally, according to the classification results and the filter conditions set by the user on the title, sending and receiving addresses, etc. (such as matching whether the sender's address exists in the blacklist, or whether the title contains certain keywords), the client system The mail is filtered, and the result is displayed in the client as shown in Figure 3.

第六步：根据用户交互操作特点进行实时更新Step 6: Real-time update according to the characteristics of user interaction operation

将上一步的过滤结果显示出来，被归类为垃圾邮件的电子邮件进入“垃圾邮件箱”，正常邮件进入“收件箱”。但是，当用户发现垃圾邮件箱中存在正常邮件，或者收件箱中存在垃圾邮件时，用户可以手动将错分的邮件重新分类。同时，对这些邮件分词处理后，跳回第一步，用这些分词来更新检测器集合，并依次重新构建免疫浓度特征向量和训练分类器。检测器集合的更新方法，以一封被用户手动标注为垃圾邮件的电子邮件为例：将该邮件中不属于正常邮件检测器集的特征词，全部加入垃圾邮件检测器集。同理，被手动标注为正常邮件的电子邮件，将其不属于垃圾邮件检测器集的特征词，全部加入正常邮件检测器集。The filtering result of the previous step is displayed, and the email classified as spam goes into the "junk mail box", and the normal email goes into the "inbox". However, when the user finds that there are normal emails in the junk mail box or spam emails in the inbox, the user can manually reclassify the misclassified emails. At the same time, after processing the word segmentation of these emails, jump back to the first step, use these word segmentation to update the detector set, and rebuild the immune concentration feature vector and train the classifier in turn. The update method of the detector set takes an email manually marked as spam by the user as an example: all the feature words in the email that do not belong to the normal email detector set are added to the spam detector set. Similarly, for emails manually marked as normal emails, all the feature words that do not belong to the spam detector set are added to the normal email detector set.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims

1. A client-based personalized email filtering system is characterized in that it includes a receiving module, filtering and updating module, and a display module;

The receiving module is used to receive emails, then preprocess the received emails, and pass the preprocessing results to the filtering module;

Described filter and update module comprises database, condition matcher and intelligent detection classifier; Database is the training data set stored locally; Condition matcher is used for user to set filter condition, according to filter condition, the mail that receives is filtered, reuses The intelligent detection classifier detects and classifies the received emails to obtain the classification of the received emails; at the same time, it uses the received emails to update the local training data set of the intelligent detection classifier in real time, thereby realizing personalized email filtering and classification ;

The display module displays the results of email filtering and classification.

2. The personalized email filtering system as claimed in claim 1, characterized in that, the JAVA language programming is used to realize the personalized email filtering system; the intelligent detection classifier is realized by calling the function library of Weka.

3. The personalized email filtering system according to claim 1, wherein the filter conditions set by the user include keyword filter conditions and sender address filter conditions.

4. A client-based personalized email filtering method, comprising a training phase and a filtering phase; the training phase inputs the training data set into a classifier, learns and optimizes the parameters of the classifier, and obtains an optimal classifier; In the filtering stage, the optimal classifier trained is used for the mail received in the client; the mail filtering method is based on the immune concentration feature, and obtains a personalized training data set for each user through the real-time update of the local data set of the client , to realize the personalized spam filtering requirements of different users; specifically, the following steps are included:

1) In the training phase, perform the following steps:

11) for the existing email data set, according to the amount of information and the tendency of word segmentation to generate a detector set; the detector set includes a normal mail detector set and a spam detector set;

12) For the existing email data set, use step 11) to construct the detector set to construct the immune concentration feature vector, and obtain the immune concentration feature vector corresponding to each mail in the email data set;

13) Utilize the immune concentration feature vector corresponding to each mail obtained in step 12), train the classifier, and obtain the trained classifier model;

2) In the filtering stage, perform the following steps:

21) Preprocessing the received emails, parsing each received email to obtain the title, text, recipient address, sender address of the email, the title, recipient address, sender address, setting filter conditions for mail classification; word segmentation for the text, each mail is divided into multiple feature words;

22) Classify and filter the received mail, perform the following operations:

221) For each received email, use the detector set constructed in step 11) to reconstruct each received email into a corresponding immune concentration feature vector, and obtain the immune concentration feature vector corresponding to each received email ;

222) Using the classifier model described in step 13) to classify the mail, the classification includes spam or normal mail, thus obtaining the classification result;

223) According to the classification result and the filter condition set by the user, filter the received mail, and further obtain a filtering result; the result is that the received mail is classified as spam or normal mail;

23) Real-time update and display according to user interaction operation.

5. mail filtering method as claimed in claim 4, is characterized in that, step 23) carries out real-time update and display according to user interaction operation and comprises following situation:

23a) When a received email is classified as spam, said email enters the "junk mail box";

23b) When the received mail is classified as normal mail, said mail enters the "inbox";

23c) When the user finds that there are normal mails in the "junk mail box" or spam mails in the "inbox", the user can manually reclassify the misclassified mails; perform word segmentation processing on the reclassified mails to obtain Segmentation, turn to the training phase of step 1), use the word segmentation to update the detector set, and rebuild the immune concentration feature vector and train the classifier in turn.

6. mail filtering method as claimed in claim 4, it is characterized in that, step 11) obtains detector set according to the information amount and the tendency degree of participle, the information amount and the tendency degree of described participle are calculated by word screening method and tendency degree respectively calculated by the method;

The word screening method is specifically:

For the existing email data set, the information gain I(t) of all feature words is calculated by formula 1, and all feature words are sorted according to the size of the information gain I(t), and the top m% features are sorted words added to the gene pool;

I I G G ((t t)) = = {Σ Σ}_{i i = = 11}^{m m} P P (({c c}_{i i})) log log P P (({c c}_{i i})) + + P P ((t t)) {Σ Σ}_{i i = = 11}^{m m} P P (({c c}_{i i} | | t t)) log log P P (({c c}_{i i} | | t t)) + + P P ((\overset{&OverBar; &OverBar;}{t t})) {Σ Σ}_{i i = = 11}^{m m} P P (({c c}_{i i} | | \overset{&OverBar; &OverBar;}{t t})) log log P P (({c c}_{i i} | | \overset{&OverBar; &OverBar;}{t t}))

(Formula 1)

In the above formula, P(C _i ) represents the frequency of documents of category C _i in the data set; P(t) represents the probability of documents containing feature word t in the data set; Indicates the probability of a document that does not contain a feature word t in the data set; P(C _i |t) indicates the probability that a document belongs to category C _i under the premise that the feature word t appears; Indicates the probability that the document belongs to category C _i under the premise that the feature word t does not appear;

The calculation of the tendency degree is specifically: for each characteristic word in the gene pool, calculate the frequency of occurrence of the characteristic word in spam and normal mail; when the frequency of occurrence of the characteristic word in normal mail When the frequency of the feature word in spam is greater than the frequency of the feature word in spam, the feature word is recorded in the normal mail detector set; when the frequency of the feature word in spam is greater than the frequency of the feature word in normal mail, This feature word is entered into the spam detector set; thus two types of detector sets are generated.

7. The mail filtering method according to claim 6, wherein the value of m is 50.

8. mail filtering method as claimed in claim 4, is characterized in that, step 12) described concrete method of constructing immune concentration feature vector is:

Count the number of different feature words appearing in the spam detector set and the ham detector set for each email in the email data set;

Let N represent the number of different feature words in each email, S represent the number of feature words that appear in the spam detector set in each email, and L represent the number of feature words that appear in the normal email detector set in each email Construct and obtain a two-dimensional vector, denoted as (S/N, L/N), as the immune concentration feature vector, thus obtaining the immune concentration feature vector corresponding to each email in the email data set.

9. The mail filtering method according to claim 4, wherein the classifier adopts a support vector machine (SVM).

10. The mail filtering method according to claim 4, characterized in that, in the training process of step 13), the parameter optimization of the classifier is performed using a quadratic programming method.