CN106100973A

CN106100973A - A kind of personalized rubbish mail filtering method based on node similarity and defecator

Info

Publication number: CN106100973A
Application number: CN201610408178.9A
Authority: CN
Inventors: 刘昕; 邹苹钧; 王奕文; 王丰; 辛兆君
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2016-11-09

Abstract

Embodiments provide a kind of rubbish mail filtering method.Based on node similarity the personalized rubbish mail filtering method of the present invention, it obtains, by the community network at Email User place, the spam information that its trusted good friend is grasped, based on the interest similarity between user, concentrate individual wisdom to form group intelligence and realize Spam filtering.The embodiment of the present invention also provides for a kind of personalized Spam Filtering System device based on node similarity.Spam can be filtered by technical scheme that the embodiment of the present invention provides in time, and improves the accuracy rate of filtration.

Description

A kind of personalized rubbish mail filtering method based on node similarity and defecator

Technical field

The present invention relates to a kind of rubbish mail filtering method, particularly to a kind of personalized rubbish based on node similarity Mail filtering method and defecator.

Background technology

Email is communication mode indispensable in daily life, but thing followed spam is as pestilence one As spread, pollute and destroy the environment of network, to have become network reliable for the subject study for anti-rubbish mail in such circumstances The important subject of communication.

The filtering technique of conventional garbage mail is divided into filter method based on blacklist/white list, based on matched rule And filtration of based on Mail Contents.Black/white list technology is a kind of Spam filtering that range of application is the widest.Black name List can be the IP address of mail server or Email Sender, domain name or E-mail address list, any mail Mail source is if there is being all identified as spam in blacklist；Similar blacklist method, any mail source occurs in In white list list, mail is all considered legitimate mail.Although this technology is the most simple and is easily configured, but in reality Application in poor effect, rate of failing to report and rate of false alarm are the highest, and also compare the real-time update of black/white list with safeguarding Difficulty.

In filter method based on matched rule, most widely used is filter method based on bayesian algorithm.Use The filter of bayesian algorithm can reach the filtration accuracy rate of more than 90%, but this filtering technique needs to use centralized mistake Filter, it is impossible to make full use of the information in network, filter efficiency is impacted, and there is the hidden danger of single point failure, but Due to the high-accuracy of filter based on bayesian algorithm, the method is still worth using for reference.

Method based on Mail Contents is conceived to mail text, carries out text classifying, the method for information filtering to be to filter Mail, it is possible to automatically obtaining spam feature, the accuracy rate of this method is higher, but the most inevitably there are some offices Limit: the character problem of (1) text.The method to the content of text highly dependent upon, for some disguised strong spam, single Text is filtered and just filters out from normal email by single dependence is relatively difficult.(2) problem of filter efficiency.The method The content being required for mail carries out larger numbers of coupling and calculating, and taking internal memory and CPU is that comparison is high, and And the same envelope mail received for many users, to repeatedly carry out the operation calculating and filtering, this leveraged Filter efficiency.

Being found by analysis, the maximum feature of spam is mass-sended exactly, refers to sender same envelope spam It is sent to substantial amounts of recipient.And tradition Spam filtering substantially lacks the analysis to the similarity between user.Cause This, if all Email Users can participate in the filtration of spam in whole network, and by respective institute The understanding information to spam grasped is shared mutually, it is possible to largely make up content-based filtering method Not enough.

Summary of the invention

In order to solve problem of the prior art, the invention provides a kind of personalized spam based on node similarity Filter method, the method application society's trust value and Interest Similarity calculate the credible journey that spam is reported by user good friend Degree, concentrates individual wisdom to form group intelligence, obtains the spam filtering letter that its trusted good friend is grasped fully, in time Breath, thus realize Spam filtering.

The technical solution adopted in the present invention is as follows:

A kind of personalized rubbish mail filtering method based on node similarity, comprises the following steps:

A, according to the social relations of user and similarity, the social trusting degree between definition user and Interest Similarity；

B, using social network user as node, user associates the relation between people as limit, sets up social networks Topological relation figure.Set up the trust value list of user good friend；Set up user this locality interest list, calculate according to interest key word and use Interest Similarity between family；Information according to the spam obtained sets up the local spam list of user；

When C, user receive mail, carry out ground floor filtration by twit filter based on bayesian algorithm.If Judge that this mail is spam, this mail is labeled as spam and is stored in local rubbish with the form of spam report Mail tabulation；

D, user are according to the spam report received, and application node similarity carries out second layer filtration.

E, to reach trust value threshold value and Interest Similarity threshold value good friend's node send spam report.

In step A, the society between described user trusts and refers to: according to system environments residing for user, user A according to The direct trust to user B contacting historical record and draw of user B；Described Interest Similarity refers to: if two users Between there is identical interest, then it is assumed that there is Interest Similarity between user；

In step B, described good friend is that user directly contacts other nodes frequently, described Interest Similarity calculating side Formula is as follows:

I.e. Jaccard Coefficient method: JC=M11/ (M10+M01+M11), wherein M11 represents that two users are emerging Interest key word total in interest list, M10 and M01 all represents one of them user distinctive interest key word.

In step B, the spam information of described acquisition be the marked spam information of user and other contact Crinis Carbonisatus delivers to the spam report information of this node, stores with the MD5 cryptographic Hash of Mail Contents.

In step C, after user accepts mail, can by this mail first with the spam letter of local spam list Breath mates, if not finding the entry of coupling, using bayesian algorithm to mate, updating local rubbish after having mated Mail tabulation.

In step D, when user receives the spam report of other users, if in the local spam list of user There is this report, this report will be left in the basket.There is this mail else if in user's inbox, this mail will be moved into rubbish Case, and it is stored in local spam list with the form of spam report.

In step E, the trust value between user reaches trust value threshold value and Interest Similarity reaches Interest Similarity threshold During value, the spam public lecture of a user is automatically pushed to other users, and spam report includes spam content MD5 cryptographic Hash, the Interest Similarity between user and the flag bit of labelling spam.

On the other hand, the invention provides a kind of personalized junk mail filter device based on node similarity, including With lower module:

Definition module: according to social relations and the similarity of user, the social trusting degree between definition user and interest Similarity；

Set up module: using social network user and good friend thereof as node.Set up good friend's trust value list of user；Set up The local interest list of user, calculates the Interest Similarity between user according to interest key word；Information according to spam Set up the local spam list of user；

Based on bayesian filtering module: when user receives mail, by twit filter based on bayesian algorithm Carry out ground floor filtration.If judging, this mail is spam, this mail is labeled as spam and reports with spam Form be stored in local spam list；

Personalized filtering module based on node similarity: user is according to the spam report received, based on node phase Second layer filtration is carried out like property.

Sending module: send spam report to the good friend's node reaching trust value threshold value and Interest Similarity threshold value.

Technical scheme and acquisition device that the present invention provides have the benefit that

The present invention, by building the community network model of mail user composition, considers the social trust value between user And Interest Similarity, application Interest Similarity algorithm calculates the user good friend credibility to the report of spam, in time Obtain the accuracy rate of the spam filtering information raising Spam filtering that its trusted good friend is grasped.

Accompanying drawing explanation

For the technical scheme being illustrated more clearly that in the embodiment of the present invention, in embodiment being described below required for make Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only one embodiment of the present of invention, for From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings Accompanying drawing.

Fig. 1 be the present invention a kind of based on node similarity personalized rubbish mail filtering method in simple mail network Topology diagram；

A kind of based on node similarity the personalized junk mail filter device mistake that Fig. 2 provides for one embodiment of the invention Filter spam schematic diagram

A kind of based on node similarity the personalized junk mail filter device that Fig. 3 provides for one embodiment of the invention Structural representation.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.

Embodiment one

The basis of the present embodiment is, rationally arranges native system parameter in advance, to improve Spam filtering essence Degree.

Trust value between interest and the user of different user is different, therefore in personal information is arranged, sets in advance Determining the interest key word that the system of user uses, user selects key word interested according to the interest of oneself or loses interest in Key word.User is according to the trusting degree of its good friend sets the initial trust value to its good friend, and stores it in user Local trust value table in.Interest Similarity between user and its good friend is calculated by Jaccard Coefficient formula Arrive, and be stored in the local interest list of user.The system numerical value according to Interest Similarity and the Interest Similarity threshold value of setting Comparison update user's trust value to its good friend, and on each node set a local spam list, in order to protect Deposit the MD5 value of calculated spam, with storage and shared spam information.

After user receives an envelope mail, first native system can calculate its MD5 cryptographic Hash according to Mail Contents, with this locality The spam information coupling of storage in spam list, if there is occurrence, this mail will be marked as spam also Lose into refuse bin；Otherwise, this envelope mail can be given twit filter (ground floor mistake based on bayesian algorithm by native system Filter) judge whether this is an envelope spam.If it is determined that this is an envelope spam, this mail is lost into rubbish by native system In rubbish case, and the information of this mail is stored in local spam list.Simultaneously by Spam filtering based on interest The report of this spam is pushed to device (second layer filter) trust value with this user and Interest Similarity exceedes threshold value Good friend.If this user receives the spam report of its good friend, system will be according to the report of paid-in spam and this mail Matching result judge that this mail is whether as spam.

After good friend's node of this user receives spam report, local system first determines whether that whether this report is at this Ground spam list exists, if it is judged that think that this report exists, has then ignored and do not process.Otherwise, if Mail corresponding to this report is in user's inbox, and this mail can be moved in refuse bin by system, and this report is stored in this locality Then spam list carries out next step propelling movement.

Claims

1. a personalized rubbish mail filtering method based on node similarity, comprises the following steps:

B, using social network user as node, user associates the relation between people as limit, sets up social networks topology Graph of a relation.Set up the trust value list of user good friend；Set up user this locality interest list, according to interest key word calculate user it Between Interest Similarity；Information according to the spam obtained sets up the local spam list of user；

When C, user receive mail, carry out ground floor filtration by twit filter based on bayesian algorithm.If judging This mail is spam, this mail is labeled as spam and is stored in local spam with the form of spam report List；

A kind of rubbish mail filtering method the most according to claim 1, it is characterised in that in described step A, described Society's trust between user refers to: according to system environments residing for user, user A remembers according to the history that directly contacts with user B The trust to user B recorded and draw；Described Interest Similarity refers to: if there is identical interest between two users, then Thinking and there is Interest Similarity between user, identical interest key word is the most, then similarity degree is the highest.

A kind of rubbish mail filtering method the most according to claim 1, it is characterised in that in described step B, described Trust value computing mode is: T=T_ij+ b, i.e. on the basis of setting initial trust value, if the Interest Similarity between good friend is high In threshold value, trust value can increase according to the increment b set.Described Interest Similarity calculation is as follows: i.e. Jaccard Coefficient method: JC=M11/ (M10+M01+M11), interest total during wherein M11 represents two user interest lists Key word, M10 and M01 all represents one of them user distinctive interest key word.

In described step B, the spam information of described acquisition be the marked spam information of user and other join It is the Crinis Carbonisatus spam report information of delivering to this node, stores with the MD5 cryptographic Hash of Mail Contents.

A kind of rubbish mail filtering method the most according to claim 1, it is characterised in that in described step C, work as user Receiving after mail, this mail being mated, if not finding by first spam information with local spam list The entry of coupling, uses bayesian algorithm to mate, and updates local spam list after having mated.

A kind of rubbish mail filtering method the most according to claim 1, it is characterised in that in described step D, work as user When receiving the spam report of other users, if there is this report in the local spam list of user, this report will It is left in the basket.There is this mail else if in user's inbox, this mail will be moved into refuse bin, and with spam report Form is stored in local spam list.

A kind of rubbish mail filtering method the most according to claim 1, it is characterised in that in described step E, work as user Between trust value reach trust value threshold value and time Interest Similarity reaches Interest Similarity threshold value, the spam of a user Public lecture is automatically pushed to other users, and spam report includes the MD5 cryptographic Hash of spam content, emerging between user Interest similarity and the flag bit of labelling spam.

7. a personalized Spam Filtering System device based on node similarity, including with lower module:

Definition module: according to social relations and the similarity of user, the social trusting degree between definition user is similar with interest Degree；

Set up module: using social network user and good friend thereof as node.Set up good friend's trust value list of user；Set up user Local interest list, calculate the Interest Similarity between user according to interest key word；Information according to spam is set up The local spam list of user；

Based on bayesian filtering module: when user receives mail, carried out by twit filter based on bayesian algorithm Ground floor filters.If judging, this mail is spam, and this mail is labeled as spam the shape with spam report Formula is stored in local spam list；

Personalized filtering module based on node similarity: when user receives the spam report of other users, application Node similarity carries out second layer filtration.