CN101764765A

CN101764765A - Spam mail filtering method based on user interest

Info

Publication number: CN101764765A
Application number: CN200910242936A
Authority: CN
Inventors: 谭营
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-12-21
Filing date: 2009-12-21
Publication date: 2010-06-30

Abstract

The invention discloses a spam mail filtering method based on user interest. The method comprises the following steps: resolving a mail of each user after each user receives the mail to obtain mail title, mail text, receiver address and sender address; segmenting words of the mail title and mail text, generating a feature vector according to segmented mail title and mail text and a detector, and training on the training set of each user to generate a sorter model for each user; sorting the mails according to the sorter model corresponding to each user when receiving new mails; retraining the sorter model for each user with the mails when detecting that user interest changes set and shown through spam mail definition by users. The invention is designed to the overall performance of spam mail detection, effectively detect change of spam mail, retrain sorter models of user when detecting that user interest changes, and be adaptive to change of user demand or interest.

Description

Rubbish mail filtering method based on user interest

Technical field

The present invention relates to network safety filed, relate in particular to a kind of rubbish mail filtering method based on user interest.

Background technology

In view of the serious social concern that spam caused, in recent years, the anti-rubbish mail strategy has received unprecedented concern.Many scholars concentrate on research focus the detection and the filtration of the spam of automation, have proposed many methods, and as blacklist, machine learning (comprises

Bayes, Support Vector Machine, Neural Network, Boosting Trees etc.).

For the Spam filtering service is provided to the user, mail service provider is applied in the server rank with these methods of filtering spam and deals with the work accordingly, yet their effect is but also not fully up to expectations.One of them topmost problem is that existing spam detection server disposition is not distinguished the interest of different user, can't preserve independently operational factor and configuration separately for each user, more can't adapt to the variation of user interest.

Existing server deploy spam detection technology is preserved unified spam detection parameter for all users, and consistent model is provided.Yet this implementation can't satisfy the situation of user interest difference (to the different definition of spam and normal email) and user interest variation.

On the one hand, existing mail server realizes that technology can't satisfy the different user's request of user.In actual life, user's interest also is not quite similar.For example: concerning the same mail that comprises recruitment information, user's first can assert that it is a normal email, because he looks for a job.User's second then can be owing to not needing these information to assert that it is a spam.In this case, if provide unified parameter setting and unified detection model to users all on the server, server will inevitably provide wrong spam detection information to the certain user so.If server judges that by detecting that mail is a spam, error detection information is provided then can for user's first, this normal email of user's first is used as serviced device as Spam filtering; Otherwise server provides error message will for user's second, can not filter this mail effectively for user's second.

On the other hand, existing mail server realizes that technology can not adapt to the variation of user interest.Because prior art is carried out unified detected parameters setting to all users, so when the interest (to the definition of spam) of some mail user when changing, server can not be adjusted according to these users' interest, otherwise will bring negative effect (because their interest does not change, so the adjustment of parameter can cause detecting performance decrease on the contrary) to other users.

Summary of the invention

The present invention is directed to the deficiencies in the prior art, a kind of new rubbish mail filtering method based on user interest is provided, this method is by preserving independent parameter setting separately (corresponding separately independently sorter model) for each user, thereby according to the difference of user interest, for they produce corresponding classification of mail result.And this scheme can detect the variation of each user interest, and adjusts each respective classified device model in time according to changing.When the interest of having only the certain user changes, this scheme will be adjusted the corresponding model of these users, thereby carry out retraining and reclassify.

For achieving the above object, the invention provides a kind of rubbish mail filtering method based on user interest, may further comprise the steps:

S1 after each user gets the mail, resolves respectively the mail at standby family, obtains title, text and addressee and the sender address of mail, and wherein address of the addressee is used for selecting and determining its corresponding detectors collection and sorter model;

S2, the title and the text of mail are carried out participle, according to the title of the mail behind the participle and text, detector collection generating feature vector, by on each user training set separately, training, for each user generates separately independently sorter model, when receiving new mail, mail is classified according to each user's respective classified device model, when detecting user interest and change, sorter model to relative users carries out retraining with mail, and described user interest embodies the definition setting of spam by the user.

Described sorter model is support vector unit (being a plurality of SVMs).

When user interest changes, adopt the sliding window of forming by a plurality of SVMs that detector collection and characteristic vector are upgraded.

The step of among the described step S2 mail being carried out retraining is specially: adopt the sliding window of being made up of a plurality of SVMs, and new mail carries out retraining to user's sorter model, and adjusts corresponding support vector unit of each user and detector collection successively.

The step of among the described step S2 mail being classified is specially: with the corresponding support vector unit of user that described characteristic vector input is determined by address of the addressee, the classification results that returns promptly determines the classification of this mail.

Technique scheme has following advantage: by each user being provided with different classifier parameters, preserving different sorter models, can improve the overall performance of spam detection; By using the sliding window of forming by a plurality of SVMs can also detect the variation of user interest effectively, and after the variation that detects user interest, the sorter model of relative users is carried out retraining, with the variation of adaptive user demand.

Description of drawings

Fig. 1 is the flow chart of the method for the embodiment of the invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used to illustrate the present invention, but are not used for limiting the scope of the invention.

As shown in Figure 1, the filter method according to a kind of spam based on user interest of the embodiment of the invention may further comprise the steps:

S1 after each user gets the mail, resolves respectively each user's mail, obtains title, text and addressee and the sender address of mail, and wherein address of the addressee is used for selecting and determining its corresponding detectors collection and sorter model;

S2, the title and the text of mail are carried out participle, according to the title of the mail behind the participle and text, detector collection generating feature vector, by on each user training set separately, training, for each user generates separately independently sorter model, when receiving new mail, mail is classified according to each user's respective classified device model, when detecting user interest and change, sorter model to relative users carries out retraining to mail, and described user interest embodies the definition setting of spam by the user.

Described sorter model is the support vector unit.

When user interest changes, use the sliding window of forming by a plurality of SVMs that detector collection and characteristic vector are upgraded.

The step of among the described step S2 mail being carried out retraining is specially: adopt the sliding window of being made up of a plurality of SVMs, and new mail carries out retraining to user's sorter model, and adjusts the corresponding support vector unit of each user successively and the detector collection is finished.

The step of among the described step S2 mail being classified is specially: with the pairing support vector unit of described characteristic vector input by the definite user of address of the addressee, the classification results that returns promptly determines the classification of this mail.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and modification, these improve and modification also should be considered as protection scope of the present invention.

Claims

1. the rubbish mail filtering method based on user interest is characterized in that, may further comprise the steps:

S2, the title and the text of mail are carried out participle, title and text and detector collection generating feature vector according to the mail behind the participle, by on each user training set separately, training, for each user generates separately independently sorter model, when receiving new mail, mail is classified according to each user's respective classified device model, when detecting user interest and change, sorter model to relative users carries out retraining with mail, and described user interest embodies the definition setting of spam by the user.

2. the rubbish mail filtering method based on user interest as claimed in claim 1 is characterized in that, described sorter model is the support vector unit.

3. the rubbish mail filtering method based on user interest as claimed in claim 2 is characterized in that, when user interest changes, uses the sliding window of being made up of a plurality of SVMs that detector collection and characteristic vector are upgraded.

4. the rubbish mail filtering method based on user interest as claimed in claim 2, it is characterized in that, the step of among the described step S2 mail being carried out retraining is specially: according to the sliding window of being made up of a plurality of SVMs, adjust corresponding support vector unit of each user and detector collection successively and finish.

5. the rubbish mail filtering method based on user interest as claimed in claim 2, it is characterized in that, the step of among the described step S2 mail being classified is specially: with the corresponding support vector unit of described characteristic vector input by the definite user of address of the addressee, the classification results that returns promptly determines the classification of this mail.