CN106713108A - Mail classification method combining user relationships with Bayers theory - Google Patents

Mail classification method combining user relationships with Bayers theory Download PDF

Info

Publication number
CN106713108A
CN106713108A CN201510779256.1A CN201510779256A CN106713108A CN 106713108 A CN106713108 A CN 106713108A CN 201510779256 A CN201510779256 A CN 201510779256A CN 106713108 A CN106713108 A CN 106713108A
Authority
CN
China
Prior art keywords
mail
transferred
user
customer relationship
white list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510779256.1A
Other languages
Chinese (zh)
Other versions
CN106713108B (en
Inventor
周可
王桦
刘庆
沈慧羊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201510779256.1A priority Critical patent/CN106713108B/en
Publication of CN106713108A publication Critical patent/CN106713108A/en
Application granted granted Critical
Publication of CN106713108B publication Critical patent/CN106713108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1458Denial of Service

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Human Resources & Organizations (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a mail classification method combining user relationships with the Bayers theory. According to the method, through extracting user relationships contained in mails to construct a user relationship diagram and combining the Naive Bayes method, automatic classification of the electronic mails is realized, an accuracy rate of a classification system is improved, and a misjudgment rate is reduced. Through the method, a confidence factor is proposed to estimate credibility of a classification result of a Naive Bayes classifier, the Naive Bayes method is combined with the user relationship diagram, the user relationships contained in the normal mails are utilized to construct the user relationship diagram, and a user white list is generated according to general mail processing habit rules of users. In a new mail classification process, classification results are continuously fed back to the user relationship diagram, the user white list is further updated, so the user relationship diagram and the user white list are automatically adjusted by the classification system according to change of the new mails, and thereby the higher accuracy rate is realized.

Description

A kind of process for sorting mailings of combination customer relationship and bayesian theory
Technical field
The invention belongs to data mining technology field, customer relationship and bayesian theory are combined more particularly, to one kind Process for sorting mailings.
Background technology
Today of high speed development in internet, daily life combines together with network environment, more and more People such as is handled official business, is done shopping, being consumed, being entertained at the activity using internet, and wherein more to turn into people daily for Email (E-mail) One of important means of communication.According to CNNIC (CNNIC) at 2 months 2015 the 35th time of issue 《China Internet network state of development statistical report》It has been shown that, cut-off in December, 2014 China netizen scale breaks through 6.49 hundred million, its Middle Email User scale is more than 2.51 hundred million.And abroad, about 9.29 hundred million business email accounts in 2013, and also In sustainable growth.But problem is also following, substantial amounts of spam is flooded with the life of people, work, is even more on network Spam overflows.Existing data shows, the spam 2002 propagated on internet is only and accounts for the 36% of sum, arrives The 80% of sum is risen within 2006, to this ratio in 2010 more than 95%.
Email utilization rate so high and huge user group, bring easily same Working Life is given people When, also provide platform to some people for harhoring evil intentions.The mail received in the mailbox of most people exceedes Spam, the daily life that these spams not only give people brings puzzlement, occupies subscriber mailbox space, processes these Mail wastes the energy of people again, while also bringing very big pressure to mail server, consumes many Internet resources.Rubbish Mail to promote based on type advertisement, training publicity and system sending out notice message etc., other mail comprising reaction, pornographic, Gambling etc., for normal users, these energy that all need to lose time go treatment.Some spams are even included Virus, has seriously threatened the machine and private information security of user.
Mail sorting technique experienced the development of more than ten years, and existing many technologies are used in real life at present.When Preceding Spam Classification technology main both at home and abroad mainly has following several:
1) keyword (word)
Spam Classification technology based on keyword is mainly by building sensitive keys word (word) storehouse to realize, closes Key word (word) storehouse contain in spam it is possible that major part sensitivity vocabulary, such as " discount ", " promotion ", " rob Purchase ", " " etc., when the word in including keyword (word) storehouse in an envelope mail, it is often belonging to spam.In order to enter One step improves the accuracy rate of the method, and many people often include a keyword (word) in using keyword scoring technology, i.e. mail Word in storehouse just makes the score of the mail plus 1, after last PTS exceedes the threshold values of certain setting, it just is judged into rubbish Mail.This method is more anti-spam technologies early stage application, because realizing that very simple and processing speed is very fast. However, as the development of Internet technology, increasing spam are spread unchecked, type is also ever-changing, in order to continue to keep Classification accuracy, keyword (word) storehouse needs frequent maintenance to update, and needs professional personage to carry out.In addition, this Technology is proved to that False Rate is too high in practice, and limitation is too big.
2) black and white gray list
The mainly contact person of mail, IP address, DNS or domain name that black and white gray list technology is directed to, by building phase Contacts list, IP address list, DNS lists or domain name list for answering etc. realize that mail is classified.With towards the black of IP address As a example by lime list technology, black list techniques are mainly one blacklist list of structure, contain all spammers IP address, when an envelope letter mail arrive when, check this blacklist list, if the IP is present in the list, be judged to It is spam;And white list technology and blacklist contrast, it is not rubbish postal if the IP is present in white list list Part;Gray list technology mail server when user sends mail for the first time can record the mail header, and require sender Being resend within the time that gray list specifies to pass through.However, using gray list technology due to may repeatedly send meeting Cause extra network bandwidth expense and increase the expense of server, and blacklist and white list technology are stronger due to it Decision rule limitation, it is easy to cause erroneous judgement, a intact list sets up relatively difficult in practice, it is general only by black and white name Nonoculture is the supplementary means of categorizing system.
3) fingerprint recognition
Fingerprint identification technology is to generate corresponding finger print information according to the content of each mail, when carrying out classification and judging, need Finger print information is submitted to global server, server end safeguards a fingerprint base, and rubbish is reported as according to same fingerprint The number of times of mail determines whether to be spam.However, the method needs often to safeguard fingerprint base, and to spam Speech, it is necessary to which largely propagating and be submitted to global server could obtain recognition accuracy higher.
4)KNN
KNN (K-Nearest-Neighbor) method is most according to K most like in feature space sample of mail Certain classification belonging to number is judged.Every subseries needs to compare the similarity of mail and all samples in sample space, choosing Wherein similarity maximum K is taken, finally determines the most of affiliated classifications of this K sample as final using majority vote method The classification of mail to be sorted.However, KNN is calculated than relatively time-consuming, it is only suitable for small-scale sample classification, and different Sample Storehouses pair The value susceptibility of K differs, it is necessary to very careful selection K values.
5) Bayes
Bayes method is the prior probability according to mail sample calculates the posterior probability of mail, using statistical general Rate computational methods, calculate the probability that mail is belonging respectively to each classification, and therefrom the maximum classification of select probability is used as the mail Classification.Bayes method generates a grader firstly the need of training sample, then according to this grader to other mails Classified.However, the performance of bayes method grader is highly dependent upon training process, size, quality are to final point Class performance impact is very big, and grader cannot be changed once completing, it is difficult to adapt to the dynamic change of mail.
6)SVM
SVMs (Support Vector Machine) method is to construct optimum linearity by training sample to classify Face, optimal classification surface ensure that maximum classification gap, and the method is relatively adapted to small-scale sample classification or used in high dimensional pattern In identification, work well.
7) method based on community (Community)
Mail is polymerized to several communities, i.e. classification by community-based sorting technique according to similarity, judges new mail During community, it and other other several intercommunal similarities are calculated respectively, the maximum community of selection wherein Similarity value makees It is the community of mail;In addition, whether so that the center similarity of the community has increased after calculating the corresponding community of mail addition Plus, if then new mail is added to the community.This method has assumed initially that mail can be divided into several communities according to content, Secondly need to calculate the similarity of mail and all samples in community when being divided, be adapted to small-scale mail point Class.
8) method based on social relationships and URL
Method (UNIK) based on social relationships and URL is specifically used to classify the mail comprising URL, according to mail Both sides contact person and comprising common URL build graph of a relation, all normal users are found out from graph of a relation, progressively cut down relation Figure, last remaining node is spammer.The method requirement spam must propagate more than normal email, i.e., Other people are largely sent to, and the mail comprising URL can only be classified, there is certain limitation.
Except above-mentioned these methods enumerated, mail classification also has some other methods, such as based on decision tree Method, Boosting methods, the method for Behavior-based control pattern, the sorting technique based on agreement, the method based on image recognition etc. Deng.
However, existing these process for sorting mailings there is a problem of it is following:
1st, the anti-rubbish mail sorting technique such as format and content for the treatment of change spam is inefficient;
2nd, the customer relationship included in mail is ignored;
3rd, the False Rate of normal email is higher, will normal email be judged to the probability of spam.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, customer relationship and Bayes are combined the invention provides one kind Theoretical process for sorting mailings, it is intended that solving to be led due to lacking consideration customer relationship present in existing sorting technique The mail classification accuracy of cause is not high to close with normal email False Rate technical problem higher, and the user that will be included in mail System is combined with Nae Bayesianmethod, improves the accuracy rate of mail classification, reduces the False Rate of mail.
To achieve the above object, according to one aspect of the present invention, there is provided one kind combines customer relationship and managed with Bayes The process for sorting mailings of opinion, comprises the following steps:
(1) mail sample is obtained, according to the mail sample training Naive Bayes Classifier, is wherein divided into training sample Contact contact in normal email and spam, and normal email in training sample builds customer relationship figure graphMap;
(2) the customer relationship figure graphMap built according to step (1) extracts user's white list, wherein at the beginning of user's white list It is sky to begin;
(3) according to the customer relationship figure built in step (1) and the Naive Bayes Classifier for training and step (2) User's white list of middle extraction is judged new mail, and updates customer relationship figure and user's white list according to result of determination.
Preferably, the process for building customer relationship figure graphMap includes following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1-2), is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3), is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counters simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if step (1-5) is not transferred to if, otherwise turns Enter step (1-6);
The account of the addressee is added idMap by (1-5), while update id counters, and by the identity mark of the addressee Knowledge is added in recipient list's set Set, is transferred to step (1-6);
Whether (1-6) judges the identity of sender in customer relationship figure graphMap, if not being transferred to step if Suddenly (1-7), otherwise it is transferred to step (1-8);
Sender id and recipient list's set Set are added to customer relationship figure by (1-7) as a key-value pair In graphMap, step (1-9) is then transferred to;
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges there be other mails in mail sample, if then return to step (1-1), else process knot Beam.
Preferably, step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set People's identity is added to user's white list;
Whether (2-2) judges there be the key-value pair not read in customer relationship figure graphMap, if being then transferred to step (2-1), else process terminates.
Preferably, step (3) includes following sub-step:
(3-1) reads new mail, judges that whether the sender of the mail in user's white list, if it is shows the postal Part is normal email, then be transferred to step (3-2), is otherwise transferred to step (3-3);
Be added to customer relationship between article receiving and sending people in customer relationship figure graphMap by (3-2), and addressee is added In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification treatment to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish Rubbish mail, is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be confidence level high spam, be if it is transferred to step Suddenly (3-5), otherwise it is transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is confidence level high using Naive Bayes Classifier and according to credible work factor Normal email, is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
Addressee's identity ID is added user's white list by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether also new mail to be sorted, if then return to step (3-1), otherwise terminates.
Preferably, step (3-3) is if specifically, mail belongs to the probability of spam and mail belongs to normal email The ratio between probability cost>Cost-sensitive factor lambda, then mail be judged to spam, be otherwise judged to normal email.
Preferably, step (3-3) if specifically, cost>Parameter δ, then it represents that change classification results have it is very high credible Degree, then judge spam of the mail as confidence level high, and wherein δ is much larger than the parameter of cost-sensitive factor λ value.
In general, by the contemplated above technical scheme of the present invention compared with prior art, can obtain down and show Beneficial effect:
1st, can solve the problem that and processed present in existing method that changing the anti-rubbish mails such as spam format and content classifies Technical efficiency problem not high:Due to present invention employs the Naive Bayes Classifier method in step (1), by training Piao Plain Bayes classifier learns new anti-rubbish mail sorting technique, and spam form and interior is changed therefore, it is possible to solve treatment The inefficient problems of anti-rubbish mail sorting technique such as appearance;
2nd, can solve the problem that the problem for ignoring the customer relationship included in mail present in existing method:Because the present invention is adopted With the structure customer relationship drawing method in step (1), the customer relationship included in mail and Naive Bayes Classifier are entered Row is combined, therefore, it is possible to solve the problems, such as to lack the customer relationship included in mail;
3rd, can solve the problem that normal email present in existing method judges problem higher by accident:Due to present invention employs step Suddenly the extraction user's white list method in (2), in advance extracts the normal email of confidence level high, and without naive Bayesian point Class device detects that normal email is mistaken for the probability of spam for reduction, higher therefore, it is possible to solve normal email False Rate Problem;
4th, the present invention filters out the normal email of confidence level high by white list, reduces the mail of Naive Bayes Classifier Judge number, improve the efficiency of mail classification.
Brief description of the drawings
Fig. 1 is mail classification process figure.
Fig. 2 is the Establishing process of customer relationship figure.
Fig. 3 is the Establishing process of initial user white list.
Fig. 4 is the flow chart that the present invention combines customer relationship and the process for sorting mailings of bayesian theory.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each implementation method Not constituting conflict each other can just be mutually combined.
Naive Bayesian (Naive Bayes) is to given sample instance x={ x1, x2..., xm, to judge that it belongs to , it is necessary to calculate the posterior probability that it is belonging respectively to each classification, the classification of selection wherein maximum probability is used as the reality for which classification The classification of example object.The present invention carries out classification to mail two classifications, respectively normal email (Normal) and spam (Spam) it is to select in Nae Bayesianmethod, so need to only compare P (Spam | x) and P (Normal | x) size when calculating The big classification of probable value as mail to be sorted final classification.After introducing the cost-sensitive factor (being designated as λ), as P (Spam | x) When meeting following relation with P (Normal | x):
Mail can be just judged to spam by system, otherwise be considered as normal email.
On the basis of erroneous judgement cost is considered, two classes classification knot is estimated in proposition using certainty factor (being designated as δ) to the present invention The confidence level of fruit, and do not change the classification results of Naive Bayes Classifier.Under normal circumstances, certainty factor δ values are than generation Valency sensitive factor λ value wants big many.After introducing certainty factor, it is possible to weigh the credible of Naive Bayes Classifier classification results Degree, the selection wherein normal email with confidence level high builds, updates customer relationship figure and white list, by the rubbish of confidence level high Rubbish mail is excluded outside customer relationship figure and white list so that the user in white list is the normal users firmly believed.
Specifically, mail is judged on the basis of rubbish in Naive Bayes Classifier, if meeting following relation:
So claiming this classification results has confidence level high, and corresponding mail is called the spam firmly believed.Likewise, in Piao On the basis of be judged to mail normally by plain Bayes classifier, if meeting following relation:
Corresponding mail is called the normal email firmly believed.
Setting certainty factor δ, willRepresented with cost, then work as cost>During λ, mail Spam is judged to, if cost>During δ, then the classification results just have confidence level high, the mail firmly believes to be called rubbish postal Part;As cost≤λ, mail is just considered as normal email, if now remaining to meet cost<1/ δ, then show that the mail is true Believe to be normal email.As cost values between λ and δ (or between 1/ δ and λ), the mail is only judged to general rubbish postal Part (or normal email), and not as the classification results with confidence level high.
After introducing certainty factor δ, the classification knot with confidence level high can be just found out after one relatively large value is set to δ Really, this has very important effect using customer relationship is combined to the present invention with the classification policy of bayesian theory.
Each envelope Email all contains the information such as sender, addressee, title, text, time, and sender only has one It is individual, and addressee may have multiple, the customer relationship between mail contact is also one of key character of mail.Mail comes from In the actual life of people, it has also reacted the behavioural habits feature and hobby of people to a certain extent.It is raw in reality In work, when people receive incoherent spam, people may be marked spam and place into dustbin, or Person directly deletes it, it is also possible to just let alone it in mailbox not bothering about, however, vast majority of people is all without going to reply rubbish postal Part.This is just illustrated, if it is determined that an envelope mail is normal email, then its addressee is typically exactly normal users.More enter One step finds, if the contact for having some normal emails between two contact persons is exchanged, then this two people is typically all just commonly using Family.The present invention considers people's behavioural habits rule in daily life, proposes:First, have if there is two contact persons The contact relation of normal email, then the two contact persons are normal users;Second, if it is normal email that an envelope is firmly believed, The addressee of so mail is normal users.And whether mail is the normal email firmly believed, can with proposed by the invention Believe work factor to judge.
Different from the method for existing structure customer relationship figure, the present invention does not go to build complete use using all mails Family graph of a relation, but simply use normal email therein.According to the associated person information of all normal emails, normal email is built Customer relationship figure, then therefrom extract the normally associate people with confidence level high and be put into user's white list (Whitelist), most Cause that the user in white list is believable normal users high eventually.
Combination customer relationship proposed by the present invention is as shown in Figure 1 with the process for sorting mailings assorting process of bayesian theory. , it is necessary to be first trained to Bayes classifier before classifying to a new mail, and built according to training sample initial Customer relationship figure, therefrom extract user's white list.After completing training process, the input mail new to is classified, Sender first according to mail is gone to search user's white list, and the mail transmitted by the user in white list is classified just without exception Normal mail;If not finding the sender in user's white list, then transfer to Bayes classifier to be classified, and will classification Result feeds back to customer relationship figure, while updating user's white list.
The present invention describes the process for sorting mailings of a kind of combination customer relationship and bayesian theory, by extracting in mail Comprising customer relationship build customer relationship figure, and be combined with Nae Bayesianmethod, improve the accuracy rate of mail classification, subtract The False Rate of few mail.
Bayes's classification needs a training sample that can include two class mails, and training sample is the numeral can recognize that Change form is represented.The present invention represents the content of mail using vector space model (Vector Space Model, VSM), A word in each characteristic item correspondence vocabulary, in order to simplify mail model, the weight of each characteristic item is only represented with 0 and 1. When the weight of a certain characteristic item is 0, represents this feature and occur not in mail;When the weight of a certain characteristic item is 1, just Expression occurs in that this feature in mail.The representation of final mailer content is exactly only 0 and 1 two multidimensional for value Space vector.
Mail is converted into the vector that VSM is represented, in addition it is also necessary to word segmentation processing will be carried out to mail.Current segmenting method has Many kinds, common are the method based on dictionary, forward direction maximum matching method, reverse maximum matching method, two-way maximum matching method, word Frequency statistic law, minimum participle method and the segmenting method etc. based on statistics, the present invention using NLPIR participles instrument to mail header and Text carries out word segmentation processing.NLPIR is the Chinese word segmenting that doctor Zhang Huaping team of Inst. of Computing Techn. Academia Sinica sets up System, it also known as ICTCLAS are the Words partition systems set up based on multilayer HMM, support Chinese word segmentation, English point Word, Chinese and English numeral mixing participle etc., and support part-of-speech tagging." Zhang San mails to be provided with an envelope mail entitled《New Year's Day congratulates Card》", carry out participle using NLPIR participles instrument, be set using one-level part-of-speech tagging collection, word segmentation result for "/tri-/m of q post/v Come/v /u《/ w New Year's Day/t greeting cards/n》/w”.There is provided just can be easily by some to content of text after part-of-speech tagging collection The little word of influence is filtered out, and the present invention only retains verb (/v), noun (/n) and adjective (/a).
After being processed the Email in network audit system using participle instrument, the title and text of mail just become Into word independent one by one, these words just constitute the feature of mail.Mail is classified, it is necessary to a unified spy Levy dictionary to go to represent all of mail, this feature lexicon is known as vocabulary (WordTable).If being included in all mails The word of all kinds be all added to vocabulary, that finally often obtains a very big vocabulary, big according to training sample Small unusual may have hundreds of thousands word.The excessive one side of vocabulary is unfavorable for the expression of mail, on the other hand can also increase System-computed amount expense.In order to reduce the word unrelated with mail features to the interference of categorizing system, it is necessary to cut down vocabulary Treatment.After being cut down by mark set pair vocabulary, vocabulary scale can be substantially reduced, but the vocabulary that remaining a few class words are constituted Still it is very big, in addition it is also necessary to further treatment, i.e. feature extraction (Feature Selection).
The present invention uses χ2Statistic method carries out feature extraction.χ2Statistic is between measures characteristic and classification Independence deficiency extent, as a certain feature XiWith classification CkBetween the big independence of correlation it is small when, χ2The value of statistic compared with Greatly;As a certain feature XiWith classification CkBetween the big correlation of independence it is small when, χ2The value of statistic is smaller.Use χ2Statistics Amount method extraction feature calculates the χ of each feature when general2Statistical value, therefrom select maximum specifies number a feature.
Mail classes have two, respectively spam (Spam) and normal email (Normal) in the present invention.Use χ2 Normalized set feature Xiχ2Statistical nature X is needed during statistical valueiThe number of times occurred in each classification, as shown in table 1:
Table 1:Feature XiThe occurrence number statistics in spam and normal email classification
In table 1, feature X is represented with A, B, C, D respectivelyiOccurrence number in each classification, it is assumed that represent postal using S Part sum (S=A+B+C+D), then feature Xiχ2Statistics value calculating method is as shown in expression formula 1:
Use χ2The χ of all features of normalized set2Statistical value, and therefrom k maximum feature of selective value, finally use This k feature of selection carries out quantification treatment to mail.To any mail, this appearance of k feature in mail is examined in Situation, if there is certain feature, then represent that the element value of vectorial correspondence position of the mail, with regard to 1, is otherwise 0.
In customer relationship figure, each node represents contact person's account, and the directed edge between node means that The mail sent between contact person.In order to reduce the memory space of customer relationship figure, Hash is carried out to All Contacts' account Mapping processes (idMap).Key in the mapping represents a user account, and corresponding value represents sequence number corresponding with the account, Each key-value pair represents a corresponding sequence number of user account in idMap.
One contact person's account counter of system maintenance, whenever there is new contact person's account, Counter Value adds 1, default initialization value is 0.
The present invention builds customer relationship figure for normal email, is connected with directed edge between node, is tied using one kind mapping Structure preserves customer relationship figure, referred to as graphMap.In graphMap key-value pair be with<Integer,Set<Integer>>Shape Formula occurs, and the key of each integer represents sender's sequence number, and this sequence number is the mapping of a certain user account in idMap Value;All user's sequence numbers in Set are all addressee's objects corresponding with key.Can be complete using the graphMap of this form The content for representing customer relationship figure, and using Hash mapping structure can improve inquiry user's figure interior joint speed.
As shown in figure 4, the present invention is comprised the following steps with reference to customer relationship with the process for sorting mailings of bayesian theory:
(1) mail sample is obtained, according to the mail sample training Naive Bayes Classifier, is wherein divided into training sample Contact contact in normal email and spam, and normal email in training sample builds customer relationship figure graphMap;
If for example, user Zhang San is transmitted across mail to Li Si and Wang Wujun, and the identity of Zhang San is 10, Li Si's Identity is 20, and the identity of king five is 30;Mapping relations idMap is<Zhang San, 10>,<Li Si, 20>,<King five, and 30>; Customer relationship figure graphMap is<10,<20,30>>;
Wherein, customer relationship figure graphMap is built as shown in Fig. 2 including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1-2), is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3), is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counters simultaneously by (1-3), is then transferred to step (1-4);Tool For body, id counters are updated, exactly its value is added one;
(1-4) judges that the addressee of mail whether there is in idMap, if step (1-5) is not transferred to if, otherwise It is transferred to step (1-6);
The account of the addressee is added idMap by (1-5), while update id counters, and by the identity mark of the addressee Knowledge is added in recipient list's set Set, is transferred to step (1-6);
Whether (1-6) judges the identity of sender in customer relationship figure graphMap, if not being transferred to step if Suddenly (1-7), otherwise it is transferred to step (1-8);
Sender id and recipient list's set Set are added to customer relationship figure by (1-7) as a key-value pair In graphMap, step (1-9) is then transferred to;
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges there be other mails in mail sample, if then return to step (1-1), else process knot Beam.
(2) the customer relationship figure graphMap built according to step (1) extracts user's white list, wherein at the beginning of user's white list It is sky to begin, as shown in figure 3, this step specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set People's identity is added to user's white list;
Whether (2-2) judges there be the key-value pair not read in customer relationship figure graphMap, if being then transferred to step (2-1), else process terminates.
(3) according to the customer relationship figure built in step (1) and the Naive Bayes Classifier for training and step (2) User's white list of middle extraction is judged new mail, and updates customer relationship figure and user's white list according to result of determination, This step specifically includes following sub-step:
(3-1) reads new mail, judges that whether the sender of the mail in user's white list, if it is shows the postal Part is normal email, then be transferred to step (3-2), is otherwise transferred to step (3-3);
Be added to customer relationship between article receiving and sending people in customer relationship figure graphMap by (3-2), and addressee is added In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification treatment to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish Rubbish mail, is if it is transferred to step (3-4), is otherwise transferred to step (3-6), specifically, sets the cost-sensitive factor as λ is used It is spam or normal email to distinguish classification results, the classification results foundation of Naive Bayes Classifier is designated as Cost, i.e. mail belong to the probability of spam and mail belongs to the ratio between probability of normal email, if cost>During λ, then mail It is judged to spam, is otherwise judged to normal email, λ values is 1 in the present invention;
(3-4) using credible work factor δ judge mail whether be confidence level high spam, be if it is transferred to step Suddenly (3-5), step (3-9) is otherwise transferred to, specifically, one parameter δ much larger than cost-sensitive factor λ value of setting, if cost>δ, then it represents that change classification results with confidence level very high, then judge spam of the mail as confidence level high, the present invention Middle δ values are 100;
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is confidence level high using Naive Bayes Classifier and according to credible work factor Normal email, is if it is transferred to step (3-7), is otherwise transferred to step (3-8), specifically, if cost<1/ δ, then judge Mail is the normal email of confidence level high;
Addressee's identity ID is added user's white list by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether also new mail to be sorted, if then return to step (3-1), otherwise terminates.
In order to prevent occurring improper user in user's white list, the user in user's white list can be checked. If certain user is present in user's white list, when being classified, normal postal is not judged to directly with a less Probability p Part, but the mail is put into Bayes classifier, according to the classification results of Bayes classifier, if normal email, then Illustrate that the user is normal users;If classification results are spams, the user is deleted from user's white list.Use This mode, can prevent spam user to be mixed into user's white list and influence the situation of genealogical classification result.
As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, it is not used to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc., all should include Within protection scope of the present invention.

Claims (6)

1. the process for sorting mailings of a kind of combination customer relationship and bayesian theory, it is characterised in that comprise the following steps:
(1) mail sample is obtained, according to the mail sample training Naive Bayes Classifier, is wherein divided into training sample normal Contact contact in mail and spam, and normal email in training sample builds customer relationship figure graphMap;
(2) the customer relationship figure graphMap built according to step (1) extracts user's white list, and wherein user's white list is initially It is empty;
(3) carried according in the customer relationship figure and the Naive Bayes Classifier that trains and step (2) built in step (1) The user's white list for taking is judged new mail, and updates customer relationship figure and user's white list according to result of determination.
2. process for sorting mailings according to claim 1, it is characterised in that build the process of customer relationship figure graphMap Including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1- 2), otherwise it is transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3), Otherwise it is transferred to step (1-4);
The account of the sender is added idMap and updates id counters simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if step (1-5) is not transferred to if, is otherwise transferred to step Suddenly (1-6);
The account of the addressee is added idMap by (1-5), while updating id counters, and the identity of the addressee is added Enter in recipient list's set Set, be transferred to step (1-6);
Whether (1-6) judges the identity of sender in customer relationship figure graphMap, if not being transferred to step (1- if 7), otherwise it is transferred to step (1-8);
Sender id and recipient list's set Set are added to customer relationship figure graphMap by (1-7) as a key-value pair In, then it is transferred to step (1-9);
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges there be other mails in mail sample, and if then return to step (1-1), else process terminates.
3. process for sorting mailings according to claim 2, it is characterised in that step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee person in recipient list's set Set Part mark is added to user's white list;
Whether (2-2) judges there be the key-value pair not read in customer relationship figure graphMap, if being then transferred to step (2- 1), else process terminates.
4. process for sorting mailings according to claim 3, it is characterised in that step (3) includes following sub-step:
(3-1) reads new mail, judges that whether the sender of the mail in user's white list, if it is shows that the mail is Normal email, then be transferred to step (3-2), is otherwise transferred to step (3-3);
Be added to customer relationship between article receiving and sending people in customer relationship figure graphMap by (3-2), and addressee is added into user In white list, and it is transferred to step (3-9);
(3-3) carries out classification treatment to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish postal Part, is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be confidence level high spam, be if it is transferred to step (3-5), is otherwise transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is the normal of confidence level high using Naive Bayes Classifier and according to credible work factor Mail, is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
Addressee's identity ID is added user's white list by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether also new mail to be sorted, if then return to step (3-1), otherwise terminates.
5. process for sorting mailings according to claim 4, it is characterised in that step (3-3) is if specifically, mail belongs to The probability of spam belongs to the ratio between probability of normal email cost with mail>Cost-sensitive factor lambda, then mail be judged to rubbish Mail, is otherwise judged to normal email.
6. process for sorting mailings according to claim 5, it is characterised in that step (3-3) if specifically, cost>Ginseng Number δ, then it represents that change classification results with confidence level very high, then judge spam of the mail as confidence level high, wherein δ is remote More than the parameter of cost-sensitive factor λ value.
CN201510779256.1A 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory Active CN106713108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510779256.1A CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510779256.1A CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Publications (2)

Publication Number Publication Date
CN106713108A true CN106713108A (en) 2017-05-24
CN106713108B CN106713108B (en) 2019-08-13

Family

ID=58930716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510779256.1A Active CN106713108B (en) 2015-11-13 2015-11-13 A kind of process for sorting mailings of combination customer relationship and bayesian theory

Country Status (1)

Country Link
CN (1) CN106713108B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344060A (en) * 2021-05-31 2021-09-03 哈尔滨工业大学 Text classification model training method, litigation shape classification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100005149A1 (en) * 2004-01-16 2010-01-07 Gozoom.Com, Inc. Methods and systems for analyzing email messages
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100005149A1 (en) * 2004-01-16 2010-01-07 Gozoom.Com, Inc. Methods and systems for analyzing email messages
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王巍: ""基于用户关系挖掘和信誉评价的垃圾邮件识别算法"", 《万方》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344060A (en) * 2021-05-31 2021-09-03 哈尔滨工业大学 Text classification model training method, litigation shape classification method and device

Also Published As

Publication number Publication date
CN106713108B (en) 2019-08-13

Similar Documents

Publication Publication Date Title
US9171070B2 (en) Method for classifying unknown electronic documents based upon at least one classificaton
US7930353B2 (en) Trees of classifiers for detecting email spam
Méndez et al. A comparative performance study of feature selection methods for the anti-spam filtering domain
CN103441924B (en) A kind of rubbish mail filtering method based on short text and device
US8112484B1 (en) Apparatus and method for auxiliary classification for generating features for a spam filtering model
CN101784022A (en) Method and system for filtering and classifying short messages
CN101540017B (en) Feature extracting method based on byte level n-gram and twit filter
CN103136266A (en) Method and device for classification of mail
CN103473218A (en) Email classification method and email classification device
Ruskanda Study on the effect of preprocessing methods for spam email detection
WO2021136315A1 (en) Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content
Sharma et al. E-Mail Spam Detection Using SVM and RBF.
CN111614543B (en) URL-based spear phishing mail detection method and system
Iyengar et al. Integrated spam detection for multilingual emails
Cheng et al. Personalized spam filtering with semi-supervised classifier ensemble
Khan et al. Text mining approach to detect spam in emails
Deng et al. Research on a naive bayesian based short message filtering system
Jawale et al. Hybrid spam detection using machine learning
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
CN106230690B (en) A kind of process for sorting mailings and system of combination user property
CN110048936B (en) Method for judging junk mail by semantic associated words
CN106713108B (en) A kind of process for sorting mailings of combination customer relationship and bayesian theory
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
Anitha et al. Email spam classification using neighbor probability based Naïve Bayes algorithm
CN104156228B (en) A kind of embedded feature database of client filtering short message and update method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant