CN110753024A

CN110753024A - Personalized mail re-filtering method in collective environment

Info

Publication number: CN110753024A
Application number: CN201810822625.4A
Authority: CN
Inventors: 陈松灿; 徐丹丹
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2020-02-04

Abstract

Due to different interests and hobbies, the definitions of the spam by users are greatly different, so that realizing personalized spam filtering becomes an important subject of the research in the field of mail filtering at present. However, under the condition of complete personalization, the amount of tagged mails of a specific user is limited, and the problem of tag delay of a personalized filter also exists. Meanwhile, the mails received by users in the same group (school, college or company) environment have certain relevance, so that the information learned by the fully personalized mail filter is limited. When the mail is mistakenly filtered, the user has to manually modify the mail, which brings great inconvenience to the user experience. In order to effectively solve the problems, the invention provides a personalized mail re-filtering method in a collective environment, and realizes the functions of personalized mail filtering, wrong filtered mail automatic modification and the like.

Description

Personalized mail re-filtering method in collective environment

Technical Field

The invention belongs to a method in the field of information filtering, in particular to a personalized mail re-filtering method in a collective environment, which is mainly applied to the technical field of data mining to realize mail filtering.

Background

Although mail is one of popular communication tools, convenience is brought to life and work of people, a large amount of spam (spam) also seriously reduces the working efficiency of people, and particularly, the spam filtering becomes an essential part of a mail service system because the spam must be manually modified to be normal when the mail is filtered by mistake. The spam filtering technology identifies whether the current mail is normal according to the existing spam characteristics, wherein normal is normal (marked as 0), and otherwise spam (marked as 1). The general filter workflow is shown in figure 1. The spam filtering can be regarded as one of the problems of text-oriented two-classification, but is different from the general text classification because the spam filtering has great personalized difference, different users can have distinct classification results on the same mail, and the globally uniform binary filtering standard cannot meet the subjective judgment of all users on the mail. However, in the collective environment, there is a great deal of correlation and dependency between the mail received by users, which requires the design of filters that are weighted between individual characteristics and collective characteristics. Meanwhile, the mail is used as an online application, and with the continuous change of network culture, the characteristics of the junk mail and the interest points of the user can be changed, so that a dynamic environment is formed. The traditional spam filter learns based on a large corpus and then detects the unlabeled mail classes, and the hypothesis is that the mail training set and the test set data are subjected to the same distribution, but in the real situation, the hypothesis is not true under the dynamic environment, which brings great challenges to relevant researchers.

Mail filters are largely classified into two types according to the filtering range: a single generalized filter for all users and a personalized filter for a particular user. The former is usually arranged at a server side to filter mails of all users, and the filter learns the global unified concept of junk mails, so that the interest characteristics of individual users cannot be accurately reflected, and a lot of misjudgment situations exist. Therefore, spam filtering personalization is also becoming a primary task in the field of mail filtering. The personalized filter is arranged at the client, only the mails of the individual user are filtered, the current interest characteristics of the user are analyzed according to the feedback information of the user, and then the mails are filtered, so that the problem of serious misjudgment of the generalized filter is solved. In recent years, various mail filtering methods have been proposed by scholars at home and abroad.

Han et al propose a Relaxed Online Support Vector Machine (ROSVM) model, which significantly speeds up filter training at low cost through relaxation constraints, and which adopts a typical Online learning method Online SVM as a filter for identifying mail categories. Subsequently, Sun et al propose an active learning method based on misjudgment and low-certainty (MLC) based on ROSVM, that is, select a misjudged email and an email with an uncertain prediction result as a training data set, thereby reducing training cost.

Recently, in order to overcome the problem of filter performance reduction caused by the continuous change of the content of the junk mails and the individuation of the mail class judgment of the user, Sanghani et al propose a new individualized filtering method based on an incremental SVM, and a heuristic update attribute set is performed before the incremental SVM is introduced, so that the classification model effectively learns the changed data distribution. And (4) performing feature selection on the retraining samples by using Information Gain (IG) to generate a new attribute set, and replacing the attribute with low value of part of IG in the original attribute set. Although the above two methods alleviate the problem of mail misjudgment to a certain extent, the subjective judgment of the individual user on the mail category is not considered, and the judgment also changes along with the time. In an actual application scenario, a generalized filter needs to have robustness, and a personalized filter needs to have expandability. To address such challenges, Junejo et al design robust personalized filters based on local and global discriminant models. The method comprises the steps of respectively establishing multidimensional spaces of junk mail keywords and normal mail keywords by adopting marked training samples, projecting the multidimensional spaces to a two-dimensional space to obtain a local model, and obtaining global discrimination model parameters by minimizing the filtering error of a training set. The model can be used as a generalized filter, and can also be updated according to unmarked samples of different user-specific inboxes to serve as a personalized filter, so that the model is suitable for the state that the combined distribution of mails and labels changes along with different users and different times. Although the two methods utilize the personalized features of the user to improve the filtering accuracy, the problem of class prior difference and class imbalance is caused because the false filtering of mails of different mailboxes (an inbox and a garbage box) of a specific user is not considered in combination with the actual situation. Meanwhile, under the completely personalized condition, the amount of the marked mails of a specific user is limited, and the personalized filter has the problem of marking delay. In order to solve such problems, the present invention proposes a special personalized mail filtering method in a collective environment.

Disclosure of Invention

[ OBJECTS OF THE INVENTION ]

The existing mainstream mail filtering system still has the wrong filtering condition, such as normal mails stored in a garbage box, and the junk mails received by an inbox. The problem of misjudgment of the mails still remains to be solved in the field of mail filtering. The main causes of such problems can be summarized as follows. First, spammers constantly change the content characteristics of spam in order to avoid detection of filters, resulting in changes in data distribution over time. Secondly, whether the received mails are spam mails or not is related to the interest points of the users at the current stage, namely the interest points of one user can be changed at different time periods, and the mails of the same type can be individually marked according to subjective factors. The two cases correspond to conceptual drift. Finally, the judgment of the mail category by a specific user usually has subjective definition, so that the mail filtering error can be caused by the fact that the global uniform filtering standard irrelevant to the user is not consistent with the subjective definition of the user, and the probability of filtering error can be effectively reduced by combining the definitions of the collective and the individual to the junk mails.

For the concept drift problem in mail data streams, we formalize as follows:

(1) entering the same mailbox (inbox or garbage box), and the distribution of the data flow at different moments is changed:

wherein P (.) represents data distribution, x represents mail feature representation, y represents mail category, t₁、t₂Representing different time instants; because the spam maker is about to avoid the filter detection, the content of the mail will change continuously, so that the feature distribution at different time is different.

(2) At the same time, the data flow distribution difference between different mails is as follows:

wherein the content of the first and second substances,

P_i(.) represents inbox data distribution, P_g(.) represents a trash can data distribution. Generally, the number of normal mails in the inbox of the user is larger than that of the junk mails, and similarly, the junk mails in the inbox are more than the normal mails. And in severe cases, the problem of unbalance-like occurs: p_i(y＝0|x)＞＞P_g(y＝0|x)，P_i(y＝1|x)＜＜P_g(y ═ 1| x). We refer to the two cases as "generalized virtual drift" in mail.

Different from the existing spam filter, in order to increase the diversity of spam samples and protect the privacy of users, users in the same group only share spam, and the accuracy of predicting the spam by the personalized filter is improved. In order to effectively solve the three problems and realize functions of Personalized Mail filtering, wrong Mail filtering automatic modification and the like, the invention provides a Client-Based Personalized Mail Re-filtering System (A Personalized Mail Re-filtering System Based on the Client in the collective Environment by combining rules and statistical methods. Most of the existing junk mail filters only perform online filtering on mail data streams, but do not consider the problems of difference and class imbalance of mail classes of different mailboxes in a priori, the filtering system firstly processes mails entering an inbox and a dustbin respectively, then designs two filters which learn with each other based on a multi-task learning principle to be used for filtering the mails of the inbox and the dustbin respectively, and automatically modifies wrongly-filtered mails. Meanwhile, in order to ensure the performance of the filter under the condition of user interest points and mail data distribution which change along with time, a multi-window learning framework combined with importance weighting is designed, so that the dynamic self-adaption of the filter is effectively realized.

[ technical solution ] A

In order to protect the privacy of each network user, all users of the same group, which are one of the scenes set by the invention, can independently emit respective junk mails so that other users can share the public information, and the diversity of the junk mails is increased for personalized filtering.

The invention comprises the following contents:

the quantity of users of the same group is fixed, once the junk mails are shared, whether the junk mails are repeated with the mails in the group junk box is detected, and if the junk mails are repeated, the reported rate of the mails is updated; otherwise, the mail is added to the collective trash.

Setting that the mails with higher reporting rate are successively put into a private garbage box of a specific user, and detecting whether the mails are junk mails or not by the Co-PRFC according to the user interest degree. If yes, throwing the garbage into a garbage can; otherwise, the mail is thrown into an inbox.

The invention mainly aims at the problem of wrong division, so that the situation of the (2) th point in generalized virtual drift can occur when two filters (a Filter _ junkbox and a Filter _ inbox) are adopted to respectively Filter a data stream of a garbage can and a data stream of an inbox.

As time goes on, the interest points of a specific user also change, so the invention designs a multi-window learning frame (with real mark windows: long window LW, short window SW; without real mark window: target window TW), detects whether interest changes through the prediction accuracy of sub-models L and S for the mail with long window and short window, and resets the L model by S if the interest changes. LW represents the content of all sample sets after the last model update, and SW stores a fixed number of samples in the near future, so when the error rate of L is lower than S, the interest point of the current user is stable, otherwise, the interest point of the user is changed in the near future.

The invention passes the nuclear density ratioDetecting whether the distribution of the mail content is changed, if so, re-learning the model S to adapt to new data distribution, and improving the accuracy of the filter; otherwise, S is not changed. To avoid calculating the data distribution P_TW(x) Using kernel function to estimate its density ratio

Wherein N is_mIs the size of the windows TW and SW,

is the parameter of the model and is,

is a basis function.

Generally, a machine learning method is adopted to filter junk mails, and the method needs to analyze, preprocess and vectorize mails, consumes a large amount of time, so that the Co-PRFC combines rules and a statistical method to filter junk mails, thereby reducing the computational complexity and shortening the filtering time. For the mail to be predicted, firstly detecting whether the sender of the mail is credible, if so, putting the mail into an inbox; otherwise, judging whether the mail is a normal mail according to whether the mail subject contains a're' or 'reply' field, if not, sequentially vectorizing the subject and the mail body, and judging the category (as shown in fig. 2). And respectively taking the subjects and texts of the vectorization of the garbage box data flow and the inbox data flow as input variables of a Filter _ junkbox and a Filter _ inbox.

[ PROBLEMS ] the present invention

We implemented the proposed filtration system Co-PRFC using the development tool Python. Models L and S corresponding to LW and SW in the filtering system are realized by adopting an integrated algorithm, and an SVM is used as a base learner. In order to obtain the optimal parameters, 10000 sample data are randomly selected from TREC 2006c, and private inbox data streams and garbage bin data streams with unbalanced classes are constructed according to the proportion. Meanwhile, experiments prove that the filtering performance of the filter can be improved by adopting multitask and utilizing collective environment. By taking TREC 2006c, TREC 2007p and SEWM2010 as experimental data, performance of the Co-PRFC is compared with that of the existing filter, and the filter provided by the people is verified to have a remarkable filtering effect. The method has certain popularization, and not only can be used for filtering mails, but also can be used for filtering information such as short messages and microblog comments.

Drawings

FIG. 1: main flow of filtering junk mail

FIG. 2: Co-PRFC predicted mail marking process

FIG. 3: Co-PRFC system framework

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present invention and are intended to fall within the scope of the appended claims.

The framework of the invention is shown in fig. 3, users of the same collective network can autonomously send out spam in order to share information with each other. We set a collective user quantity M (we define 150), which remains unchanged. And (3) judging whether the junk mails sent out are in the collective junk mail box or not at first, if so, updating the reporting rate (namely judging the junk mails are junk mails by the current users), and if not, setting the initial reporting rate to be 1/M and then placing the junk mails into the collective junk box. The filtering system of a particular user occasionally accesses the collective trash, brings the reporting rate above 1/3, and introduces non-redundant mail into the private trash data stream for detection. The following is a Co-PRFC pseudo-code implementation;

inputting: sample with true mark

Sample without true mark

Initial position T of test mail email, LW after parsing₀And the current position T₁An acceptable error rate threshold ρ for the L model, a confidence threshold ξ for the prediction marker, an initialized Filter Filter _ inbox and Filter _ garpage.

And (3) outputting: prediction tag y of email.

Claims

1. A method for re-filtering personalized mail in a corporate environment, comprising the steps of:

firstly, fixing the user quantity of the same group, once a junk mail is shared, firstly detecting whether the junk mail is repeated with the mail in the group garbage box, and if so, updating the reported rate of the mail; otherwise, the mail is added to the collective trash.

And secondly, setting that the mails with higher reporting rate are sequentially put into a private garbage box of a specific user, and detecting whether the mails are junk mails or not by the Co-PRFC according to the user interest degree. If yes, throwing the garbage into a garbage can; otherwise, the mail is thrown into an inbox.

And thirdly, a machine learning method is generally adopted to filter the junk mails, and the mails need to be analyzed, preprocessed by data, vectorized and the like, so that a large amount of time is consumed, the Co-PRFC is combined with rules and a statistical method to filter the junk mails, the calculation complexity is reduced, and the filtering time is shortened. For the mail to be predicted, firstly detecting whether the sender of the mail is credible, if so, putting the mail into an inbox; otherwise, judging whether the mail is a normal mail according to whether the mail subject contains a're' or 'reply' field, if not, sequentially vectorizing the subject and the mail body, and judging the category (as shown in fig. 2).

And fourthly, aiming at the problem that the error division is caused, two filters (a Filter _ junkbox and a Filter _ inbox) are adopted to respectively Filter the data stream of the garbage box and the data stream of the inbox. And respectively taking the subjects and texts of the vectorization of the garbage box data flow and the inbox data flow as input variables of a Filter _ junkbox and a Filter _ inbox. But the situation of the (2) th point in the generalized virtual drift can occur in the separated filtering, and the invention is based on a Multi-task Learning (Multi-task Learning) theory, and by taking the reference of the mutual feature description, the two filters are mutually learned and respectively filtered to relieve the class imbalance problem.

And fifthly, the interest points of the specific user also change along with the time, so that the invention designs a multi-window learning frame (with real mark windows: long window LW and short window SW; without real mark windows: target window TW), detects whether the interest of the mail with long and short windows changes through the prediction accuracy of the sub-models L and S, and resets the L model by S if the interest of the mail with long and short windows changes. LW represents the content of all sample sets after the last model update, and SW stores a fixed number of samples in the near future, so when the error rate of L is lower than S, the interest point of the current user is stable, otherwise, the interest point of the user is changed in the near future.

Sixthly, the invention passes the nuclear density ratioDetecting whether the distribution of the mail content is changed, if so, re-learning the model S to adapt to new data distribution, and improving the accuracy of the filter; otherwise, S is not changed. To avoid calculationsData distribution P_TW(x) Using kernel function to estimate its density ratio

Wherein N is_mIs the size of the windows TW and SW,

is the parameter of the model and is,

is a basis function.

2. The problem of sharing spam in a corporate environment as in the first and second steps of claim 1, wherein the diversity of spam is increased for a particular user while preserving user privacy, improving the accuracy with which personalized filters predict spam. In the first step, in order to ensure that the mails in the collective junk mail box are not redundant and to judge whether the mails are acknowledged junk mails or not, the reported probability of each mail is marked; for a particular user, the recognized spam is not necessarily spam, but may be normal, so the personalized filter is used to determine the mail category in the collective spam for subsequent filter learning.

3. The use of two filters according to the fourth step of claim 1 is characterized by using a new filtering method to further alleviate the problem of wrong filtering of system filters and to realize automatic correction of wrong filtered mails. Aiming at the problem of class imbalance of data streams of two mailboxes (a private inbox and a private garbage box), the invention learns two filters mutually and respectively filters the two filters based on a multi-task learning theory.

4. The multi-window frame and kernel density ratio of claims 1 in the fifth and sixth steps is characterized by a good mitigation of concept drift. Comparing the filter precision of different windows through multi-window frame design to detect whether the interest degree of a user changes, and if so, adjusting the filter; the core density ratio can determine whether the distribution of the current mail content is drifted, and if the distribution is drifted, the filter is updated. The combination of the two has great alleviation effect on the difficulty of mail filtering.