CN106713108B - A kind of process for sorting mailings of combination customer relationship and bayesian theory - Google Patents

A kind of process for sorting mailings of combination customer relationship and bayesian theory Download PDF

Info

Publication number
CN106713108B
CN106713108B CN201510779256.1A CN201510779256A CN106713108B CN 106713108 B CN106713108 B CN 106713108B CN 201510779256 A CN201510779256 A CN 201510779256A CN 106713108 B CN106713108 B CN 106713108B
Authority
CN
China
Prior art keywords
mail
transferred
customer relationship
user
white list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510779256.1A
Other languages
Chinese (zh)
Other versions
CN106713108A (en
Inventor
周可
王桦
刘庆
沈慧羊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201510779256.1A priority Critical patent/CN106713108B/en
Publication of CN106713108A publication Critical patent/CN106713108A/en
Application granted granted Critical
Publication of CN106713108B publication Critical patent/CN106713108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1458Denial of Service

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Human Resources & Organizations (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses the process for sorting mailings of a kind of combination customer relationship and bayesian theory, customer relationship figure is constructed and in conjunction with improved Nae Bayesianmethod by extracting the customer relationship for including in mail, realize the automatic classification to Email, the accuracy rate of categorizing system is improved, False Rate is reduced.The confidence level of Naive Bayes Classifier classification results is estimated the invention proposes certainty factor, and Nae Bayesianmethod is combined with customer relationship figure, customer relationship figure is constructed using the customer relationship for including in normal email, and general custom law generation user's white list of mail is handled according to user.During new mail classification, classification results is constantly fed back into customer relationship figure, while updating user's white list, enable categorizing system according to the variation adjust automatically customer relationship figure and white list of new mail, to reach higher accuracy rate.

Description

A kind of process for sorting mailings of combination customer relationship and bayesian theory
Technical field
The invention belongs to data mining technology fields, more particularly, to a kind of combination customer relationship and bayesian theory Process for sorting mailings.
Background technique
In internet, today of high speed development, daily life combine together with network environment, more and more People such as is handled official business using internet, is done shopping, consumed, entertained at the activities, and wherein it is daily more to become people for Email (E-mail) One of the important means of communication.According to China Internet Network Information Center (CNNIC) in the 35th time issued in 2 months 2015 " China Internet network state of development statistical report " display, cut-off in December, 2014 China netizen scale break through 6.49 hundred million, Middle Email User scale is more than 2.51 hundred million.And at abroad, about 9.29 hundred million business email accounts in 2013, and also In sustainable growth.However problem is also following, a large amount of spam is flooded with people's lives, work, is even more on network Spam overflows.Existing data shows that the spam 2002 propagated on internet, which is only, accounts for the 36% of sum, arrives The 80% of sum is risen within 2006, to this ratio in 2010 more than 95%.
The so high utilization rate of Email and huge user group, it is easily same being brought to people's Working Life When, also platform is provided to some people to harhor evil intentions.The received mail of institute, which is more than a greater part of, in the mailbox of most people is all Spam, these spams not only bring puzzlement to daily life, occupy subscriber mailbox space, handle these Mail wastes the energy of people again, while also bringing very big pressure to mail server, consumes many Internet resources.Rubbish Mail to promote based on type advertisement, training publicity and system sending out notice message etc., other mail include reaction, pornographic, Gambling etc., for normal users, these all need to waste time energy and go processing.Some spams even include Virus has seriously threatened the machine and private information security of user.
Mail sorting technique experienced the development of more than ten years, has many technologies at present and is used in real life.When It is preceding both at home and abroad main Spam Classification technology mainly include the following types:
1) keyword (word)
Pass through building sensitive keys word (word) library mainly based on the Spam Classification technology of keyword come what is realized, closes Key word (word) library contain in spam it is possible that most of sensitive vocabulary, such as " discount ", " promotion ", " rob Purchase ", " $ " etc., when in an envelope mail including the word in keyword (word) library, it often belongs to spam.In order into One step improves the accuracy rate of this method, and many people use keyword scoring technology, i.e., often includes a keyword (word) in mail Word in library just makes the score of the mail add 1, and after last total score is more than the threshold values of some setting, it is just judged to rubbish Mail.This method is more anti-spam technologies early stage application, because realizing that very simple and processing speed is very fast. However, with the development of internet technology, more and more spams are spread unchecked, and type is also ever-changing, in order to continue to keep Classification accuracy, keyword (word) library need frequent maintenance to update, and need the personage of profession to carry out.In addition, this Technology is proved to that False Rate is too high, and limitation is too big in practice.
2) black and white gray list
Contact person, IP address, DNS or the domain name for the mainly mail that black and white gray list technology is directed to, by constructing phase Mail classification is realized in contacts list, IP address list, DNS list or domain name list for answering etc..Towards the black of IP address For lime list technology, black list techniques mainly construct a blacklist list, contain all spammers IP address, when an envelope letter mail arrive when, check this blacklist list, if the IP is present in the list, be judged to For spam;And white list technology and blacklist are exactly the opposite, are not rubbish postals if the IP is present in white list Part;Gray list technology mail server when user sends mail for the first time will record mail head's information, and require sender Retransmitting in the time as defined in gray list could pass through.However, using gray list technology due to may repeatedly send meeting It causes additional network bandwidth expense and increases the expense of server, and blacklist and white list technology are stronger due to it Decision rule limitation, it is easy to cause to judge by accident, a intact list is established relatively difficult in practice, general only by black and white name Nonoculture is the supplementary means of categorizing system.
3) fingerprint recognition
Fingerprint identification technology is to generate corresponding finger print information according to the content of each mail, when carrying out classification and judge, need Finger print information is submitted to global server, server end safeguards a fingerprint base, is reported as rubbish according to same fingerprint The number of mail determines whether for spam.However, this method needs to be regularly maintained fingerprint base, and to spam and Speech, it is necessary to which higher recognition accuracy could be obtained by largely propagating and being submitted to global server.
4)KNN
KNN (K-Nearest-Neighbor) method is that the K sample most like in feature space according to mail is most Some classification belonging to number is determined.Every subseries needs to compare the similarity of all samples in mail and sample space, choosing Wherein the maximum K of similarity are taken, majority vote method is finally used to determine classification belonging to most of this K sample as finally The classification of mail to be sorted.However, KNN calculating ratio is relatively time-consuming, it is only suitable for small-scale sample classification, and different sample databases pair The value susceptibility of K is different, needs very careful selection K value.
5) Bayes
Bayes method is the posterior probability that mail is calculated according to the prior probability of mail sample, using statistical general Rate calculation method calculates the probability that mail is belonging respectively to each classification, and therefrom the maximum classification of select probability is as the mail Classification.Bayes method generates a classifier firstly the need of training sample, then according to this classifier to other mails Classify.However, the performance of bayes method classifier is highly dependent upon training process, size, quality are to final point The influence of class performance is very big, and classifier can not be changed once completing, it is difficult to adapt to the dynamic change of mail.
6)SVM
Support vector machines (Support Vector Machine) method is that optimum linearity classification is constructed by training sample Face, optimal classification surface ensure that maximum classification gap, and this method is relatively suitble to small-scale sample classification or is used in high dimensional pattern In identification, work well.
7) it is based on the method for community (Community)
Mail is polymerized to several communities, i.e. classification according to similarity by community-based classification method, judges new mail When community, it and other other several intercommunal similarities are calculated separately, wherein the maximum community of similarity value makees for selection For the community of mail;Increase the center similarity of the community in addition, calculating after corresponding community is added in the mail Add, if then new mail is added to the community.This method has assumed initially that mail can be divided into several communities according to content, Secondly the similarity for needing to calculate all samples in mail and community when being divided, is suitble to small-scale mail point Class.
8) method based on social relationships and URL
Method (UNIK) based on social relationships and URL is specifically used to classify to the mail comprising URL, according to mail Both sides contact person and comprising common URL construct relational graph, all normal users are found out from relational graph, gradually cut down relationship Figure, last remaining node are spammer.It is more that this method requires spam that must propagate than normal email, i.e., Other people are largely sent to, and can only be classified to the mail comprising URL, there is certain limitation.
In addition to these the above-mentioned methods enumerated, there are also some other methods for mail classification, such as based on decision tree Method, Boosting method, the method for Behavior-based control mode, the classification method based on agreement, the method etc. based on image recognition Deng.
However, these existing process for sorting mailings there is a problem of it is following:
1, it is inefficient to change the anti-rubbish mails sorting techniques such as the format and content of spam for processing;
2, ignore the customer relationship for including in mail;
3, the False Rate of normal email is relatively high, i.e., normal email is determined as to the probability of spam.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of combination customer relationship and Bayes Theoretical process for sorting mailings, it is intended that solving to consider that customer relationship is led due to lacking present in existing classification method The mail classification accuracy of cause is not high and the relatively high technical problem of normal email False Rate, and the user for including in mail is closed System combines with Nae Bayesianmethod, improves the accuracy rate of mail classification, reduces the False Rate of mail.
To achieve the above object, according to one aspect of the present invention, it provides a kind of combination customer relationship and Bayes manages The process for sorting mailings of opinion, comprising the following steps:
(1) mail sample is obtained wherein to be divided into training sample according to the mail sample training Naive Bayes Classifier Normal email and spam, and according to the contact connection building customer relationship figure in the normal email in training sample graphMap;
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein at the beginning of user's white list Begin as sky;
(3) according to the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1) User's white list of middle extraction determines new mail, and updates customer relationship figure and user's white list according to judgement result.
Preferably, the process for constructing customer relationship figure graphMap includes following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1-2) is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3) is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, otherwise turns Enter step (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity mark of the addressee Knowledge is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges the identity of sender whether in customer relationship figure graphMap, if not being transferred to step if Suddenly (1-7) is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using sender id and recipient list's set Set as a key-value pair In graphMap, it is then transferred to step (1-9);
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges in mail sample there are also other mails, if there is then return step (1-1), else process knot Beam.
Preferably, step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set People's identity is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step (2-1), else process terminates.
Preferably, step (3) includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows the postal Part is normal email, then is transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and addressee is added In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish Rubbish mail is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step Suddenly (3-5) is otherwise transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is high confidence level using Naive Bayes Classifier and according to credible work factor Normal email is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
Preferably, step (3-3) is specifically, if mail belongs to the probability of spam and mail belongs to normal email The ratio between probability cost > cost-sensitive factor lambda, then mail is determined as spam, is otherwise determined as normal email.
Preferably, step (3-3) if specifically, cost > parameter δ, then it represents that change classification results have it is very high credible Degree determines mail then for the spam of high confidence level, and wherein δ is the parameter much larger than cost-sensitive factor λ value.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
1, it is able to solve processing present in existing method and changes the classification of the anti-rubbish mails such as spam format and content The not high problem of technical efficiency: since present invention employs the Naive Bayes Classifier methods in step (1), pass through training Piao Plain Bayes classifier learns new anti-rubbish mail sorting technique, therefore is able to solve processing and changes spam format and interior The inefficient problems of anti-rubbish mails sorting technique such as appearance;
2, it is able to solve the problem of ignoring the customer relationship for including in mail present in existing method: since the present invention adopts With the building customer relationship drawing method in step (1), by the customer relationship for including in mail and Naive Bayes Classifier into Row combines, therefore is able to solve the problem of lacking the customer relationship for including in mail;
3, it is able to solve normal email present in existing method and judges relatively high problem by accident: since present invention employs steps Suddenly extraction user's white list method in (2) in advance extracts the normal email of high confidence level, and without naive Bayesian point The detection of class device, reduces the probability that normal email is mistaken for spam, therefore it is relatively high to be able to solve normal email False Rate The problem of;
4, the present invention filters out the normal email of high confidence level by white list, reduces the mail of Naive Bayes Classifier Determine number, improves the efficiency of mail classification.
Detailed description of the invention
Fig. 1 is mail classification process figure.
Fig. 2 is the Establishing process of customer relationship figure.
Fig. 3 is the Establishing process of initial user white list.
Fig. 4 is the flow chart for the process for sorting mailings that the present invention combines customer relationship and bayesian theory.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
Naive Bayesian (Naive Bayes) is to given sample instance x={ x1, x2..., xm, to judge that it belongs to Which classification selects the classification of wherein maximum probability as the reality it is necessary to calculate the posterior probability that it is belonging respectively to each classification The classification of example object.The present invention to mail classified there are two classification, respectively normal email (Normal) and spam (Spam), it is to select in Nae Bayesianmethod that P (Spam | x) and P (Normal | x) size need to be only compared when so calculating Final classification of the big classification of probability value as mail to be sorted.After introducing the cost-sensitive factor (being denoted as λ), as P (Spam | x) When meeting following relationship with P (Normal | x):
Mail can be just judged to spam by system, otherwise be considered as normal email.
For the present invention on the basis of considering to judge cost by accident, two classes classification knot is estimated in proposition using certainty factor (being denoted as δ) The confidence level of fruit, and do not change the classification results of Naive Bayes Classifier.Under normal conditions, certainty factor δ value is than generation Valence sensitive factor λ value wants greatly more.After introducing certainty factor, so that it may measure the credible of Naive Bayes Classifier classification results Degree, wherein there is the normal email of high confidence level to construct, update customer relationship figure and white list for selection, by the rubbish of high confidence level Rubbish mail excludes except customer relationship figure and white list, so that the user in white list is the normal users firmly believed.
Specifically, on the basis of mail is judged to rubbish by Naive Bayes Classifier, if meeting following relationship:
So claim this classification results that there is high confidence level, corresponding mail is referred to as the spam firmly believed.Likewise, in Piao On the basis of mail is judged to normally by plain Bayes classifier, if meeting following relationship:
Corresponding mail is referred to as the normal email firmly believed.
Certainty factor δ is set, it willIt is indicated with cost, then as cost > λ, mail It is judged to spam, if when cost > δ, which just has high confidence level, which firmly believes to be referred to as rubbish postal Part;As cost≤λ, mail is just considered as normal email, if being still able to satisfy the δ of cost < 1/ at this time, shows that the mail is true Letter is normal email.When cost value is between λ and δ (or between 1/ δ and λ), the mail is only judged to general rubbish postal Part (or normal email), and not as the classification results with high confidence level.
After introducing certainty factor δ, the classification knot with high confidence level can be found out after one relatively large value is set to δ Fruit, this has very important effect the present invention using the classification policy in conjunction with customer relationship and bayesian theory.
Each envelope Email all contains the information such as sender, addressee, title, text, time, and sender only has one It is a, and addressee may have multiple, the customer relationship between mail contact is also one of important feature of mail.Mail comes from In the actual life of people, it has also reacted the behavioural habits feature and hobby of people to a certain extent.It is raw in reality In work, when people receive incoherent spam, people may be marked spam and place into dustbin, or Person directly deletes it, it is also possible to just let alone it and not bother about in mailbox, however, vast majority of people will not go to reply rubbish postal Part.This just illustrates, if it is determined that an envelope mail is normal email, then its addressee is generally exactly normal users.More into The discovery of one step, if there is the contact of some normal emails to exchange between two contact persons, this two people is typically just common Family.Behavioural habits rule of the present invention in view of people in daily life, it is provided that first, have if there is two contact persons The contact relationship of normal email, then the two contact persons are normal users;Second, if an envelope is firmly believed as normal email, The addressee of so mail is normal users.And whether mail is the normal email firmly believed, it can with proposed by the invention Work factor is believed to judge.
Different from the method for existing building customer relationship figure, the present invention simultaneously goes to construct complete use without using all mails Family relational graph, but normal email therein is only used only.According to the contact information of all normal emails, normal email is constructed Customer relationship figure, then therefrom extract the normally associate people with high confidence level and be put into user's white list (Whitelist), most Making the user in white list eventually all is high believable normal users.
The process for sorting mailings assorting process of combination customer relationship and bayesian theory proposed by the present invention is as shown in Figure 1. It before classifying to a new mail, needs first to be trained Bayes classifier, and is constructed initially according to training sample Customer relationship figure, therefrom extract user's white list.After completing training process, classify to a new input mail, It is gone to search user's white list according to the sender of mail first, be classified without exception just to mail transmitted by the user in white list Normal mail;If not finding the sender in user's white list, Bayes classifier is transferred to classify, and will classification As a result customer relationship figure is fed back to, while updating user's white list.
The present invention describes the process for sorting mailings of a kind of combination customer relationship and bayesian theory, by extracting in mail The customer relationship building customer relationship figure for including, and combined with Nae Bayesianmethod, the accuracy rate of mail classification is improved, is subtracted The False Rate of few mail.
Bayes's classification needs one can include the training sample of two class mails, and training sample is with identifiable number Change form indicates.The present invention indicates the content of mail using vector space model (Vector Space Model, VSM), Each characteristic item corresponds to a word in vocabulary, and in order to simplify mail model, the weight of each characteristic item is only indicated with 0 and 1. When the weight of a certain characteristic item is 0, represents this feature and do not occur in mail;When the weight of a certain characteristic item is 1, just There is this feature in mail in expression.The representation of final mailer content is exactly the multidimensional of only 0 and 1 two value Space vector.
Mail is converted to the vector of VSM expression, it is also necessary to word segmentation processing will be carried out to mail.Segmenting method has at present Many kinds common are method, forward direction maximum matching method, reverse maximum matching method, two-way maximum matching method, word based on dictionary Frequency statistic law, minimum participle method and the segmenting method based on statistics etc., the present invention using NLPIR participle tool to mail header and Text carries out word segmentation processing.NLPIR is the Chinese word segmenting that doctor Zhang Huaping team, Inst. of Computing Techn. Academia Sinica establishes System, it also known as ICTCLAS are the Words partition systems established based on multilayer Hidden Markov Model, support Chinese word segmentation, English point Word, Chinese and English number mixing participle etc., and support part-of-speech tagging.Entitled " " the New Year's Day he that Zhang San mails equipped with an envelope mail Card " ", segmented using NLPIR participle tool, be set using level-one part-of-speech tagging collection, word segmentation result be "/tri-/m of q posts/v Come/v /u "/w New Year's Day/t greeting card/n "/w ".Provided with can be easily by some pairs of content of text after part-of-speech tagging collection It influences little word to filter out, the present invention only retains verb (/v), noun (/n) and adjective (/a).
After being handled using participle tool the Email in network audit system, the title and text of mail just become At independent word one by one, these words just constitute the feature of mail.Classify to mail, needs a unified spy Sign dictionary removes to indicate all mails, this feature lexicon is known as vocabulary (WordTable).If including in all mails The words of all kinds be all added to vocabulary, that finally often obtains a very big vocabulary, big according to training sample Small unusual may have hundreds of thousands word.The excessive one side of vocabulary is unfavorable for the expression of mail, on the other hand can also increase System-computed amount expense.In order to reduce interference of the word unrelated with mail features to categorizing system, need to cut down vocabulary Processing.After cutting down by mark collection vocabulary, it can be substantially reduced vocabulary scale, however the vocabulary that remaining a few class words are constituted Still very big, it is also necessary to further processing, i.e. feature extraction (Feature Selection).
The present invention uses χ2Statistic method carries out feature extraction.χ2Statistic is between measures characteristic and classification Independence deficiency extent, as a certain feature XiWith classification CkBetween the big independence of correlation it is small when, χ2The value of statistic compared with Greatly;As a certain feature XiWith classification CkBetween the big correlation of independence it is small when, χ2The value of statistic is smaller.Use χ2Statistics Amount method extraction feature calculates the χ of each feature when general2Statistical value therefrom selects the maximum feature of specifying number.
In the present invention there are two mail classes, respectively spam (Spam) and normal email (Normal).Use χ2 Normalized set feature Xiχ2Statistical nature X is needed when statistical valueiThe number occurred in each classification, as shown in table 1:
Table 1: feature XiFrequency of occurrence counts in spam and normal email classification
In table 1, feature X is indicated with A, B, C, D respectivelyiFrequency of occurrence in each classification, it is assumed that indicate postal using S Part sum (S=A+B+C+D), then feature Xiχ2Value calculating method is counted as shown in expression formula 1:
Use χ2The χ of all features of normalized set2Statistical value, and the therefrom maximum k feature of selective value, finally use This k feature of selection carries out quantification treatment to mail.To any mail, this appearance of k feature in mail is examined successively Situation, if there is certain feature, then indicating that the element value of the vector corresponding position of the mail is otherwise 0 with regard to 1.
In customer relationship figure, each node represents contact person's account, and the directed edge between node is meant that The mail sent between contact person.In order to reduce the memory space of customer relationship figure, Hash has been carried out to All Contacts' account Mapping handles (idMap).Key in the mapping indicates a user account, and corresponding value indicates serial number corresponding with the account, Each key-value pair indicates a corresponding serial number of user account in idMap.
One contact person's account counter of system maintenance, whenever there is new contact person's account, Counter Value adds 1, default initialization value 0.
The present invention constructs customer relationship figure for normal email, is connected between node with directed edge, is tied using a kind of mapping Structure saves customer relationship figure, referred to as graphMap.Key-value pair is with<Integer, Set<Integer>>shape in graphMap Formula occurs, and the key of each integer represents sender's serial number, this serial number is the mapping of a certain user account in idMap Value;All user's serial numbers in Set are all addressee's objects corresponding with key.It can be complete using the graphMap of this form The content for representing customer relationship figure, and using Hash mapping structure can be improved inquiry user's figure interior joint speed.
As shown in figure 4, the process for sorting mailings of present invention combination customer relationship and bayesian theory the following steps are included:
(1) mail sample is obtained wherein to be divided into training sample according to the mail sample training Naive Bayes Classifier Normal email and spam, and according to the contact connection building customer relationship figure in the normal email in training sample graphMap;
For example, if user Zhang San is transmitted across mail to Li Si and Wang Wujun, and the identity of Zhang San is 10, Li Si's Identity is 20, and the identity of king five is 30;Mapping relations idMap be<Zhang San, 10>,<Li Si, 20>,<king five, 30>; Customer relationship figure graphMap be<10,<20,30>>;
Wherein, customer relationship figure graphMap is constructed as shown in Fig. 2, including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1-2) is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3) is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);Tool For body, id counter is updated, its value is exactly added one;
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, otherwise It is transferred to step (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity mark of the addressee Knowledge is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges the identity of sender whether in customer relationship figure graphMap, if not being transferred to step if Suddenly (1-7) is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using sender id and recipient list's set Set as a key-value pair In graphMap, it is then transferred to step (1-9);
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges in mail sample there are also other mails, if there is then return step (1-1), else process knot Beam.
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein at the beginning of user's white list Begin to be empty, as shown in figure 3, this step specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set People's identity is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step (2-1), else process terminates.
(3) according to the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1) User's white list of middle extraction determines new mail, and updates customer relationship figure and user's white list according to judgement result, This step specifically includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows the postal Part is normal email, then is transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and addressee is added In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish Rubbish mail is if it is transferred to step (3-4), is otherwise transferred to step (3-6), uses specifically, setting the cost-sensitive factor as λ It is spam or normal email to distinguish classification results, the classification results foundation of Naive Bayes Classifier is denoted as Cost, i.e. mail belong to the probability of spam and mail belongs to the ratio between probability of normal email, if when cost > λ, mail It is determined as spam, is otherwise determined as normal email, λ value is 1 in the present invention;
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step Suddenly (3-5) is otherwise transferred to step (3-9), specifically, setting one is much larger than the parameter δ of cost-sensitive factor λ value, if Cost > δ, then it represents that change classification results with very high confidence level, then determine mail for the spam of high confidence level, the present invention Middle δ value is 100;
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is high confidence level using Naive Bayes Classifier and according to credible work factor Normal email is if it is transferred to step (3-7), is otherwise transferred to step (3-8), specifically, if the δ of cost < 1/, determines Mail is the normal email of high confidence level;
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
Occur improper user in user's white list in order to prevent, the user in user's white list can be checked. If some user is present in user's white list, when being classified, normal postal is not judged to directly with a lesser Probability p Part, but the mail is put into Bayes classifier, according to the classification results of Bayes classifier, if it is normal email, then Illustrate that the user is normal users;If classification results are spams, which is deleted out of user white list.It uses This mode, the case where spam user can be prevented to be mixed into user's white list and influence genealogical classification result.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (6)

1. a kind of process for sorting mailings of combination customer relationship and bayesian theory, which comprises the following steps:
(1) mail sample is obtained, according to the mail sample training Naive Bayes Classifier, wherein training sample is divided into normally Mail and spam, and according to the contact connection building customer relationship figure graphMap in the normal email in training sample;
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein user's white list is initial For sky;
(3) it is mentioned according in the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1) The user's white list taken determines new mail, and updates customer relationship figure and user's white list according to judgement result.
2. process for sorting mailings according to claim 1, which is characterized in that the process of building customer relationship figure graphMap Including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1- 2), otherwise it is transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3), Otherwise it is transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, is otherwise transferred to step Suddenly (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity ID of the addressee It is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges whether it is if so, being transferred to step (1-4) otherwise, to judge the identity mark of sender there are also other addressees ID is known whether in customer relationship figure graphMap, if not being transferred to step (1-7) if, is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using Sender ID ID and recipient list's set Set as a key-value pair In graphMap, it is then transferred to step (1-9);
(1-8) updates the identity ID of sender in recipient list's set Set, and is transferred to step (1-7);
Whether (1-9) judges in mail sample there are also other mails, and if there is then return step (1-1), else process terminates.
3. process for sorting mailings according to claim 2, which is characterized in that step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee person in recipient list's set Set Part mark ID is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step (2- 1), else process terminates.
4. process for sorting mailings according to claim 3, which is characterized in that step (3) includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows that the mail is Normal email is then transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and user is added in addressee In white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish postal Part is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step (3-5) is otherwise transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is the normal of high confidence level using Naive Bayes Classifier and according to credible work factor Mail is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8), and addressee is added In user's white list;
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
5. process for sorting mailings according to claim 4, which is characterized in that step (3-3) is specifically, if mail belongs to The probability and mail of spam belong to the ratio between probability of normal email cost > cost-sensitive factor lambda, then mail is determined as rubbish Otherwise mail is determined as normal email.
6. process for sorting mailings according to claim 5, which is characterized in that step (3-4) is specifically, if cost > ginseng Number δ, then it represents that the classification results have very high confidence level, then determine mail for the spam of high confidence level, wherein δ is Much larger than the parameter of cost-sensitive factor λ value.
CN201510779256.1A 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory Active CN106713108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510779256.1A CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510779256.1A CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Publications (2)

Publication Number Publication Date
CN106713108A CN106713108A (en) 2017-05-24
CN106713108B true CN106713108B (en) 2019-08-13

Family

ID=58930716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510779256.1A Active CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Country Status (1)

Country Link
CN (1) CN106713108B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344060B (en) * 2021-05-31 2022-07-08 哈尔滨工业大学 Text classification model training method, litigation state classification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7590694B2 (en) * 2004-01-16 2009-09-15 Gozoom.Com, Inc. System for determining degrees of similarity in email message information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于用户关系挖掘和信誉评价的垃圾邮件识别算法";王巍;《万方》;20121126;全文

Also Published As

Publication number Publication date
CN106713108A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
US9171070B2 (en) Method for classifying unknown electronic documents based upon at least one classificaton
CN103441924B (en) A kind of rubbish mail filtering method based on short text and device
CN105005594B (en) Abnormal microblog users recognition methods
Faguo et al. Research on short text classification algorithm based on statistics and rules
US8112484B1 (en) Apparatus and method for auxiliary classification for generating features for a spam filtering model
Katirai et al. Filtering junk e-mail
Ning et al. Spam message classification based on the Naïve Bayes classification algorithm
CN101540017B (en) Feature extracting method based on byte level n-gram and twit filter
CN103136266A (en) Method and device for classification of mail
CN101330473A (en) Method and apparatus for filtrating network rubbish information supported by multiple protocols
CN103473218A (en) Email classification method and email classification device
Ruskanda Study on the effect of preprocessing methods for spam email detection
Woitaszek et al. Identifying junk electronic mail in Microsoft outlook with a support vector machine
Sharma et al. E-Mail Spam Detection Using SVM and RBF.
Bhat et al. Classification of email using BeaKS: Behavior and keyword stemming
Deng et al. Research on a naive bayesian based short message filtering system
Das et al. Analysis of an image spam in email based on content analysis
Iyengar et al. Integrated spam detection for multilingual emails
Khan et al. Text mining approach to detect spam in emails
Anitha et al. Email spam filtering using machine learning based xgboost classifier method
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
CN106230690B (en) A kind of process for sorting mailings and system of combination user property
CN106713108B (en) A kind of process for sorting mailings of combination customer relationship and bayesian theory
CN110048936B (en) Method for judging junk mail by semantic associated words
Anitha et al. Email spam classification using neighbor probability based Naïve Bayes algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant