CN106713108B - A kind of process for sorting mailings of combination customer relationship and bayesian theory - Google Patents
A kind of process for sorting mailings of combination customer relationship and bayesian theory Download PDFInfo
- Publication number
- CN106713108B CN106713108B CN201510779256.1A CN201510779256A CN106713108B CN 106713108 B CN106713108 B CN 106713108B CN 201510779256 A CN201510779256 A CN 201510779256A CN 106713108 B CN106713108 B CN 106713108B
- Authority
- CN
- China
- Prior art keywords
- transferred
- customer relationship
- user
- white list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0245—Filtering by information in the payload
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1458—Denial of Service
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Human Resources & Organizations (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Entrepreneurship & Innovation (AREA)
- Strategic Management (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Operations Research (AREA)
- Economics (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses the process for sorting mailings of a kind of combination customer relationship and bayesian theory, customer relationship figure is constructed and in conjunction with improved Nae Bayesianmethod by extracting the customer relationship for including in mail, realize the automatic classification to Email, the accuracy rate of categorizing system is improved, False Rate is reduced.The confidence level of Naive Bayes Classifier classification results is estimated the invention proposes certainty factor, and Nae Bayesianmethod is combined with customer relationship figure, customer relationship figure is constructed using the customer relationship for including in normal email, and general custom law generation user's white list of mail is handled according to user.During new mail classification, classification results is constantly fed back into customer relationship figure, while updating user's white list, enable categorizing system according to the variation adjust automatically customer relationship figure and white list of new mail, to reach higher accuracy rate.
Description
Technical field
The invention belongs to data mining technology fields, more particularly, to a kind of combination customer relationship and bayesian theory
Process for sorting mailings.
Background technique
In internet, today of high speed development, daily life combine together with network environment, more and more
People such as is handled official business using internet, is done shopping, consumed, entertained at the activities, and wherein it is daily more to become people for Email (E-mail)
One of the important means of communication.According to China Internet Network Information Center (CNNIC) in the 35th time issued in 2 months 2015
" China Internet network state of development statistical report " display, cut-off in December, 2014 China netizen scale break through 6.49 hundred million,
Middle Email User scale is more than 2.51 hundred million.And at abroad, about 9.29 hundred million business email accounts in 2013, and also
In sustainable growth.However problem is also following, a large amount of spam is flooded with people's lives, work, is even more on network
Spam overflows.Existing data shows that the spam 2002 propagated on internet, which is only, accounts for the 36% of sum, arrives
The 80% of sum is risen within 2006, to this ratio in 2010 more than 95%.
The so high utilization rate of Email and huge user group, it is easily same being brought to people's Working Life
When, also platform is provided to some people to harhor evil intentions.The received mail of institute, which is more than a greater part of, in the mailbox of most people is all
Spam, these spams not only bring puzzlement to daily life, occupy subscriber mailbox space, handle these
Mail wastes the energy of people again, while also bringing very big pressure to mail server, consumes many Internet resources.Rubbish
Mail to promote based on type advertisement, training publicity and system sending out notice message etc., other mail include reaction, pornographic,
Gambling etc., for normal users, these all need to waste time energy and go processing.Some spams even include
Virus has seriously threatened the machine and private information security of user.
Mail sorting technique experienced the development of more than ten years, has many technologies at present and is used in real life.When
It is preceding both at home and abroad main Spam Classification technology mainly include the following types:
1) keyword (word)
Pass through building sensitive keys word (word) library mainly based on the Spam Classification technology of keyword come what is realized, closes
Key word (word) library contain in spam it is possible that most of sensitive vocabulary, such as " discount ", " promotion ", " rob
Purchase ", " $ " etc., when in an envelope mail including the word in keyword (word) library, it often belongs to spam.In order into
One step improves the accuracy rate of this method, and many people use keyword scoring technology, i.e., often includes a keyword (word) in mail
Word in library just makes the score of the mail add 1, and after last total score is more than the threshold values of some setting, it is just judged to rubbish
Mail.This method is more anti-spam technologies early stage application, because realizing that very simple and processing speed is very fast.
However, with the development of internet technology, more and more spams are spread unchecked, and type is also ever-changing, in order to continue to keep
Classification accuracy, keyword (word) library need frequent maintenance to update, and need the personage of profession to carry out.In addition, this
Technology is proved to that False Rate is too high, and limitation is too big in practice.
2) black and white gray list
Contact person, IP address, DNS or the domain name for the mainly mail that black and white gray list technology is directed to, by constructing phase
Mail classification is realized in contacts list, IP address list, DNS list or domain name list for answering etc..Towards the black of IP address
For lime list technology, black list techniques mainly construct a blacklist list, contain all spammers
IP address, when an envelope letter mail arrive when, check this blacklist list, if the IP is present in the list, be judged to
For spam;And white list technology and blacklist are exactly the opposite, are not rubbish postals if the IP is present in white list
Part;Gray list technology mail server when user sends mail for the first time will record mail head's information, and require sender
Retransmitting in the time as defined in gray list could pass through.However, using gray list technology due to may repeatedly send meeting
It causes additional network bandwidth expense and increases the expense of server, and blacklist and white list technology are stronger due to it
Decision rule limitation, it is easy to cause to judge by accident, a intact list is established relatively difficult in practice, general only by black and white name
Nonoculture is the supplementary means of categorizing system.
3) fingerprint recognition
Fingerprint identification technology is to generate corresponding finger print information according to the content of each mail, when carrying out classification and judge, need
Finger print information is submitted to global server, server end safeguards a fingerprint base, is reported as rubbish according to same fingerprint
The number of mail determines whether for spam.However, this method needs to be regularly maintained fingerprint base, and to spam and
Speech, it is necessary to which higher recognition accuracy could be obtained by largely propagating and being submitted to global server.
4)KNN
KNN (K-Nearest-Neighbor) method is that the K sample most like in feature space according to mail is most
Some classification belonging to number is determined.Every subseries needs to compare the similarity of all samples in mail and sample space, choosing
Wherein the maximum K of similarity are taken, majority vote method is finally used to determine classification belonging to most of this K sample as finally
The classification of mail to be sorted.However, KNN calculating ratio is relatively time-consuming, it is only suitable for small-scale sample classification, and different sample databases pair
The value susceptibility of K is different, needs very careful selection K value.
5) Bayes
Bayes method is the posterior probability that mail is calculated according to the prior probability of mail sample, using statistical general
Rate calculation method calculates the probability that mail is belonging respectively to each classification, and therefrom the maximum classification of select probability is as the mail
Classification.Bayes method generates a classifier firstly the need of training sample, then according to this classifier to other mails
Classify.However, the performance of bayes method classifier is highly dependent upon training process, size, quality are to final point
The influence of class performance is very big, and classifier can not be changed once completing, it is difficult to adapt to the dynamic change of mail.
6)SVM
Support vector machines (Support Vector Machine) method is that optimum linearity classification is constructed by training sample
Face, optimal classification surface ensure that maximum classification gap, and this method is relatively suitble to small-scale sample classification or is used in high dimensional pattern
In identification, work well.
7) it is based on the method for community (Community)
Mail is polymerized to several communities, i.e. classification according to similarity by community-based classification method, judges new mail
When community, it and other other several intercommunal similarities are calculated separately, wherein the maximum community of similarity value makees for selection
For the community of mail;Increase the center similarity of the community in addition, calculating after corresponding community is added in the mail
Add, if then new mail is added to the community.This method has assumed initially that mail can be divided into several communities according to content,
Secondly the similarity for needing to calculate all samples in mail and community when being divided, is suitble to small-scale mail point
Class.
8) method based on social relationships and URL
Method (UNIK) based on social relationships and URL is specifically used to classify to the mail comprising URL, according to mail
Both sides contact person and comprising common URL construct relational graph, all normal users are found out from relational graph, gradually cut down relationship
Figure, last remaining node are spammer.It is more that this method requires spam that must propagate than normal email, i.e.,
Other people are largely sent to, and can only be classified to the mail comprising URL, there is certain limitation.
In addition to these the above-mentioned methods enumerated, there are also some other methods for mail classification, such as based on decision tree
Method, Boosting method, the method for Behavior-based control mode, the classification method based on agreement, the method etc. based on image recognition
Deng.
However, these existing process for sorting mailings there is a problem of it is following:
1, it is inefficient to change the anti-rubbish mails sorting techniques such as the format and content of spam for processing;
2, ignore the customer relationship for including in mail;
3, the False Rate of normal email is relatively high, i.e., normal email is determined as to the probability of spam.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of combination customer relationship and Bayes
Theoretical process for sorting mailings, it is intended that solving to consider that customer relationship is led due to lacking present in existing classification method
The mail classification accuracy of cause is not high and the relatively high technical problem of normal email False Rate, and the user for including in mail is closed
System combines with Nae Bayesianmethod, improves the accuracy rate of mail classification, reduces the False Rate of mail.
To achieve the above object, according to one aspect of the present invention, it provides a kind of combination customer relationship and Bayes manages
The process for sorting mailings of opinion, comprising the following steps:
(1) mail sample is obtained wherein to be divided into training sample according to the mail sample training Naive Bayes Classifier
Normal email and spam, and according to the contact connection building customer relationship figure in the normal email in training sample
graphMap;
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein at the beginning of user's white list
Begin as sky;
(3) according to the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1)
User's white list of middle extraction determines new mail, and updates customer relationship figure and user's white list according to judgement result.
Preferably, the process for constructing customer relationship figure graphMap includes following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step
(1-2) is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step
(1-3) is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, otherwise turns
Enter step (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity mark of the addressee
Knowledge is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges the identity of sender whether in customer relationship figure graphMap, if not being transferred to step if
Suddenly (1-7) is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using sender id and recipient list's set Set as a key-value pair
In graphMap, it is then transferred to step (1-9);
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges in mail sample there are also other mails, if there is then return step (1-1), else process knot
Beam.
Preferably, step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set
People's identity is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step
(2-1), else process terminates.
Preferably, step (3) includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows the postal
Part is normal email, then is transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and addressee is added
In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish
Rubbish mail is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step
Suddenly (3-5) is otherwise transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is high confidence level using Naive Bayes Classifier and according to credible work factor
Normal email is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
Preferably, step (3-3) is specifically, if mail belongs to the probability of spam and mail belongs to normal email
The ratio between probability cost > cost-sensitive factor lambda, then mail is determined as spam, is otherwise determined as normal email.
Preferably, step (3-3) if specifically, cost > parameter δ, then it represents that change classification results have it is very high credible
Degree determines mail then for the spam of high confidence level, and wherein δ is the parameter much larger than cost-sensitive factor λ value.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show
Beneficial effect:
1, it is able to solve processing present in existing method and changes the classification of the anti-rubbish mails such as spam format and content
The not high problem of technical efficiency: since present invention employs the Naive Bayes Classifier methods in step (1), pass through training Piao
Plain Bayes classifier learns new anti-rubbish mail sorting technique, therefore is able to solve processing and changes spam format and interior
The inefficient problems of anti-rubbish mails sorting technique such as appearance;
2, it is able to solve the problem of ignoring the customer relationship for including in mail present in existing method: since the present invention adopts
With the building customer relationship drawing method in step (1), by the customer relationship for including in mail and Naive Bayes Classifier into
Row combines, therefore is able to solve the problem of lacking the customer relationship for including in mail;
3, it is able to solve normal email present in existing method and judges relatively high problem by accident: since present invention employs steps
Suddenly extraction user's white list method in (2) in advance extracts the normal email of high confidence level, and without naive Bayesian point
The detection of class device, reduces the probability that normal email is mistaken for spam, therefore it is relatively high to be able to solve normal email False Rate
The problem of;
4, the present invention filters out the normal email of high confidence level by white list, reduces the mail of Naive Bayes Classifier
Determine number, improves the efficiency of mail classification.
Detailed description of the invention
Fig. 1 is mail classification process figure.
Fig. 2 is the Establishing process of customer relationship figure.
Fig. 3 is the Establishing process of initial user white list.
Fig. 4 is the flow chart for the process for sorting mailings that the present invention combines customer relationship and bayesian theory.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below
Not constituting a conflict with each other can be combined with each other.
Naive Bayesian (Naive Bayes) is to given sample instance x={ x1, x2..., xm, to judge that it belongs to
Which classification selects the classification of wherein maximum probability as the reality it is necessary to calculate the posterior probability that it is belonging respectively to each classification
The classification of example object.The present invention to mail classified there are two classification, respectively normal email (Normal) and spam
(Spam), it is to select in Nae Bayesianmethod that P (Spam | x) and P (Normal | x) size need to be only compared when so calculating
Final classification of the big classification of probability value as mail to be sorted.After introducing the cost-sensitive factor (being denoted as λ), as P (Spam | x)
When meeting following relationship with P (Normal | x):
Mail can be just judged to spam by system, otherwise be considered as normal email.
For the present invention on the basis of considering to judge cost by accident, two classes classification knot is estimated in proposition using certainty factor (being denoted as δ)
The confidence level of fruit, and do not change the classification results of Naive Bayes Classifier.Under normal conditions, certainty factor δ value is than generation
Valence sensitive factor λ value wants greatly more.After introducing certainty factor, so that it may measure the credible of Naive Bayes Classifier classification results
Degree, wherein there is the normal email of high confidence level to construct, update customer relationship figure and white list for selection, by the rubbish of high confidence level
Rubbish mail excludes except customer relationship figure and white list, so that the user in white list is the normal users firmly believed.
Specifically, on the basis of mail is judged to rubbish by Naive Bayes Classifier, if meeting following relationship:
So claim this classification results that there is high confidence level, corresponding mail is referred to as the spam firmly believed.Likewise, in Piao
On the basis of mail is judged to normally by plain Bayes classifier, if meeting following relationship:
Corresponding mail is referred to as the normal email firmly believed.
Certainty factor δ is set, it willIt is indicated with cost, then as cost > λ, mail
It is judged to spam, if when cost > δ, which just has high confidence level, which firmly believes to be referred to as rubbish postal
Part;As cost≤λ, mail is just considered as normal email, if being still able to satisfy the δ of cost < 1/ at this time, shows that the mail is true
Letter is normal email.When cost value is between λ and δ (or between 1/ δ and λ), the mail is only judged to general rubbish postal
Part (or normal email), and not as the classification results with high confidence level.
After introducing certainty factor δ, the classification knot with high confidence level can be found out after one relatively large value is set to δ
Fruit, this has very important effect the present invention using the classification policy in conjunction with customer relationship and bayesian theory.
Each envelope Email all contains the information such as sender, addressee, title, text, time, and sender only has one
It is a, and addressee may have multiple, the customer relationship between mail contact is also one of important feature of mail.Mail comes from
In the actual life of people, it has also reacted the behavioural habits feature and hobby of people to a certain extent.It is raw in reality
In work, when people receive incoherent spam, people may be marked spam and place into dustbin, or
Person directly deletes it, it is also possible to just let alone it and not bother about in mailbox, however, vast majority of people will not go to reply rubbish postal
Part.This just illustrates, if it is determined that an envelope mail is normal email, then its addressee is generally exactly normal users.More into
The discovery of one step, if there is the contact of some normal emails to exchange between two contact persons, this two people is typically just common
Family.Behavioural habits rule of the present invention in view of people in daily life, it is provided that first, have if there is two contact persons
The contact relationship of normal email, then the two contact persons are normal users;Second, if an envelope is firmly believed as normal email,
The addressee of so mail is normal users.And whether mail is the normal email firmly believed, it can with proposed by the invention
Work factor is believed to judge.
Different from the method for existing building customer relationship figure, the present invention simultaneously goes to construct complete use without using all mails
Family relational graph, but normal email therein is only used only.According to the contact information of all normal emails, normal email is constructed
Customer relationship figure, then therefrom extract the normally associate people with high confidence level and be put into user's white list (Whitelist), most
Making the user in white list eventually all is high believable normal users.
The process for sorting mailings assorting process of combination customer relationship and bayesian theory proposed by the present invention is as shown in Figure 1.
It before classifying to a new mail, needs first to be trained Bayes classifier, and is constructed initially according to training sample
Customer relationship figure, therefrom extract user's white list.After completing training process, classify to a new input mail,
It is gone to search user's white list according to the sender of mail first, be classified without exception just to mail transmitted by the user in white list
Normal mail;If not finding the sender in user's white list, Bayes classifier is transferred to classify, and will classification
As a result customer relationship figure is fed back to, while updating user's white list.
The present invention describes the process for sorting mailings of a kind of combination customer relationship and bayesian theory, by extracting in mail
The customer relationship building customer relationship figure for including, and combined with Nae Bayesianmethod, the accuracy rate of mail classification is improved, is subtracted
The False Rate of few mail.
Bayes's classification needs one can include the training sample of two class mails, and training sample is with identifiable number
Change form indicates.The present invention indicates the content of mail using vector space model (Vector Space Model, VSM),
Each characteristic item corresponds to a word in vocabulary, and in order to simplify mail model, the weight of each characteristic item is only indicated with 0 and 1.
When the weight of a certain characteristic item is 0, represents this feature and do not occur in mail;When the weight of a certain characteristic item is 1, just
There is this feature in mail in expression.The representation of final mailer content is exactly the multidimensional of only 0 and 1 two value
Space vector.
Mail is converted to the vector of VSM expression, it is also necessary to word segmentation processing will be carried out to mail.Segmenting method has at present
Many kinds common are method, forward direction maximum matching method, reverse maximum matching method, two-way maximum matching method, word based on dictionary
Frequency statistic law, minimum participle method and the segmenting method based on statistics etc., the present invention using NLPIR participle tool to mail header and
Text carries out word segmentation processing.NLPIR is the Chinese word segmenting that doctor Zhang Huaping team, Inst. of Computing Techn. Academia Sinica establishes
System, it also known as ICTCLAS are the Words partition systems established based on multilayer Hidden Markov Model, support Chinese word segmentation, English point
Word, Chinese and English number mixing participle etc., and support part-of-speech tagging.Entitled " " the New Year's Day he that Zhang San mails equipped with an envelope mail
Card " ", segmented using NLPIR participle tool, be set using level-one part-of-speech tagging collection, word segmentation result be "/tri-/m of q posts/v
Come/v /u "/w New Year's Day/t greeting card/n "/w ".Provided with can be easily by some pairs of content of text after part-of-speech tagging collection
It influences little word to filter out, the present invention only retains verb (/v), noun (/n) and adjective (/a).
After being handled using participle tool the Email in network audit system, the title and text of mail just become
At independent word one by one, these words just constitute the feature of mail.Classify to mail, needs a unified spy
Sign dictionary removes to indicate all mails, this feature lexicon is known as vocabulary (WordTable).If including in all mails
The words of all kinds be all added to vocabulary, that finally often obtains a very big vocabulary, big according to training sample
Small unusual may have hundreds of thousands word.The excessive one side of vocabulary is unfavorable for the expression of mail, on the other hand can also increase
System-computed amount expense.In order to reduce interference of the word unrelated with mail features to categorizing system, need to cut down vocabulary
Processing.After cutting down by mark collection vocabulary, it can be substantially reduced vocabulary scale, however the vocabulary that remaining a few class words are constituted
Still very big, it is also necessary to further processing, i.e. feature extraction (Feature Selection).
The present invention uses χ2Statistic method carries out feature extraction.χ2Statistic is between measures characteristic and classification
Independence deficiency extent, as a certain feature XiWith classification CkBetween the big independence of correlation it is small when, χ2The value of statistic compared with
Greatly;As a certain feature XiWith classification CkBetween the big correlation of independence it is small when, χ2The value of statistic is smaller.Use χ2Statistics
Amount method extraction feature calculates the χ of each feature when general2Statistical value therefrom selects the maximum feature of specifying number.
In the present invention there are two mail classes, respectively spam (Spam) and normal email (Normal).Use χ2
Normalized set feature Xiχ2Statistical nature X is needed when statistical valueiThe number occurred in each classification, as shown in table 1:
Table 1: feature XiFrequency of occurrence counts in spam and normal email classification
In table 1, feature X is indicated with A, B, C, D respectivelyiFrequency of occurrence in each classification, it is assumed that indicate postal using S
Part sum (S=A+B+C+D), then feature Xiχ2Value calculating method is counted as shown in expression formula 1:
Use χ2The χ of all features of normalized set2Statistical value, and the therefrom maximum k feature of selective value, finally use
This k feature of selection carries out quantification treatment to mail.To any mail, this appearance of k feature in mail is examined successively
Situation, if there is certain feature, then indicating that the element value of the vector corresponding position of the mail is otherwise 0 with regard to 1.
In customer relationship figure, each node represents contact person's account, and the directed edge between node is meant that
The mail sent between contact person.In order to reduce the memory space of customer relationship figure, Hash has been carried out to All Contacts' account
Mapping handles (idMap).Key in the mapping indicates a user account, and corresponding value indicates serial number corresponding with the account,
Each key-value pair indicates a corresponding serial number of user account in idMap.
One contact person's account counter of system maintenance, whenever there is new contact person's account, Counter Value adds
1, default initialization value 0.
The present invention constructs customer relationship figure for normal email, is connected between node with directed edge, is tied using a kind of mapping
Structure saves customer relationship figure, referred to as graphMap.Key-value pair is with<Integer, Set<Integer>>shape in graphMap
Formula occurs, and the key of each integer represents sender's serial number, this serial number is the mapping of a certain user account in idMap
Value;All user's serial numbers in Set are all addressee's objects corresponding with key.It can be complete using the graphMap of this form
The content for representing customer relationship figure, and using Hash mapping structure can be improved inquiry user's figure interior joint speed.
As shown in figure 4, the process for sorting mailings of present invention combination customer relationship and bayesian theory the following steps are included:
(1) mail sample is obtained wherein to be divided into training sample according to the mail sample training Naive Bayes Classifier
Normal email and spam, and according to the contact connection building customer relationship figure in the normal email in training sample
graphMap;
For example, if user Zhang San is transmitted across mail to Li Si and Wang Wujun, and the identity of Zhang San is 10, Li Si's
Identity is 20, and the identity of king five is 30;Mapping relations idMap be<Zhang San, 10>,<Li Si, 20>,<king five, 30>;
Customer relationship figure graphMap be<10,<20,30>>;
Wherein, customer relationship figure graphMap is constructed as shown in Fig. 2, including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step
(1-2) is otherwise transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step
(1-3) is otherwise transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);Tool
For body, id counter is updated, its value is exactly added one;
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, otherwise
It is transferred to step (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity mark of the addressee
Knowledge is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges the identity of sender whether in customer relationship figure graphMap, if not being transferred to step if
Suddenly (1-7) is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using sender id and recipient list's set Set as a key-value pair
In graphMap, it is then transferred to step (1-9);
(1-8) updates the corresponding value of sender in recipient list's set Set, and is transferred to step (1-9);
Whether (1-9) judges in mail sample there are also other mails, if there is then return step (1-1), else process knot
Beam.
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein at the beginning of user's white list
Begin to be empty, as shown in figure 3, this step specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee in recipient list's set Set
People's identity is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step
(2-1), else process terminates.
(3) according to the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1)
User's white list of middle extraction determines new mail, and updates customer relationship figure and user's white list according to judgement result,
This step specifically includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows the postal
Part is normal email, then is transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and addressee is added
In user's white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish
Rubbish mail is if it is transferred to step (3-4), is otherwise transferred to step (3-6), uses specifically, setting the cost-sensitive factor as λ
It is spam or normal email to distinguish classification results, the classification results foundation of Naive Bayes Classifier is denoted as
Cost, i.e. mail belong to the probability of spam and mail belongs to the ratio between probability of normal email, if when cost > λ, mail
It is determined as spam, is otherwise determined as normal email, λ value is 1 in the present invention;
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step
Suddenly (3-5) is otherwise transferred to step (3-9), specifically, setting one is much larger than the parameter δ of cost-sensitive factor λ value, if
Cost > δ, then it represents that change classification results with very high confidence level, then determine mail for the spam of high confidence level, the present invention
Middle δ value is 100;
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is high confidence level using Naive Bayes Classifier and according to credible work factor
Normal email is if it is transferred to step (3-7), is otherwise transferred to step (3-8), specifically, if the δ of cost < 1/, determines
Mail is the normal email of high confidence level;
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8);
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
Occur improper user in user's white list in order to prevent, the user in user's white list can be checked.
If some user is present in user's white list, when being classified, normal postal is not judged to directly with a lesser Probability p
Part, but the mail is put into Bayes classifier, according to the classification results of Bayes classifier, if it is normal email, then
Illustrate that the user is normal users;If classification results are spams, which is deleted out of user white list.It uses
This mode, the case where spam user can be prevented to be mixed into user's white list and influence genealogical classification result.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include
Within protection scope of the present invention.
Claims (6)
1. a kind of process for sorting mailings of combination customer relationship and bayesian theory, which comprises the following steps:
(1) mail sample is obtained, according to the mail sample training Naive Bayes Classifier, wherein training sample is divided into normally
Mail and spam, and according to the contact connection building customer relationship figure graphMap in the normal email in training sample;
(2) user's white list is extracted according to the customer relationship figure graphMap of step (1) building, wherein user's white list is initial
For sky;
(3) it is mentioned according in the customer relationship figure and trained Naive Bayes Classifier and step (2) constructed in step (1)
The user's white list taken determines new mail, and updates customer relationship figure and user's white list according to judgement result.
2. process for sorting mailings according to claim 1, which is characterized in that the process of building customer relationship figure graphMap
Including following sub-step:
(1-1) reads an envelope mail from mail sample, and judges whether it is normal email, is if it is transferred to step (1-
2), otherwise it is transferred to step (1-9);
(1-2) judges that the sender of the mail whether there is in mapping relations idMap, if not being then transferred to step (1-3),
Otherwise it is transferred to step (1-4);
The account of the sender is added idMap and updates id counter simultaneously by (1-3), is then transferred to step (1-4);
(1-4) judges that the addressee of mail whether there is in idMap, if not being transferred to step (1-5) if, is otherwise transferred to step
Suddenly (1-6);
IdMap is added in the account of the addressee by (1-5), while updating id counter, and by the identity ID of the addressee
It is added in recipient list's set Set, is transferred to step (1-6);
(1-6) judges whether it is if so, being transferred to step (1-4) otherwise, to judge the identity mark of sender there are also other addressees
ID is known whether in customer relationship figure graphMap, if not being transferred to step (1-7) if, is otherwise transferred to step (1-8);
(1-7) is added to customer relationship figure using Sender ID ID and recipient list's set Set as a key-value pair
In graphMap, it is then transferred to step (1-9);
(1-8) updates the identity ID of sender in recipient list's set Set, and is transferred to step (1-7);
Whether (1-9) judges in mail sample there are also other mails, and if there is then return step (1-1), else process terminates.
3. process for sorting mailings according to claim 2, which is characterized in that step (2) specifically includes following sub-step:
(2-1) reads a key-value pair from customer relationship figure graphMap, by the addressee person in recipient list's set Set
Part mark ID is added to user's white list;
(2-2) judges whether there are also unread key-value pairs in customer relationship figure graphMap, if there is being then transferred to step (2-
1), else process terminates.
4. process for sorting mailings according to claim 3, which is characterized in that step (3) includes following sub-step:
(3-1) reads new mail, judges that the sender of the mail whether in user's white list, if it is shows that the mail is
Normal email is then transferred to step (3-2), is otherwise transferred to step (3-3);
Customer relationship between article receiving and sending people is added in customer relationship figure graphMap by (3-2), and user is added in addressee
In white list, and it is transferred to step (3-9);
(3-3) carries out classification processing to the mail using Naive Bayes Classifier, and judges whether classification results are rubbish postal
Part is if it is transferred to step (3-4), is otherwise transferred to step (3-6);
(3-4) using credible work factor δ judge mail whether be high confidence level spam, be if it is transferred to step
(3-5) is otherwise transferred to step (3-9);
(3-5) deletes the sender of the mail from customer relationship figure graphMap, is then transferred to step (3-9);
(3-6) judges whether the mail is the normal of high confidence level using Naive Bayes Classifier and according to credible work factor
Mail is if it is transferred to step (3-7), is otherwise transferred to step (3-8);
User's white list is added in addressee's identity ID by (3-7), is then transferred to step (3-8);
Customer relationship between the article receiving and sending people of mail is added to customer relationship figure graphMap by (3-8), and addressee is added
In user's white list;
(3-9) judges whether there are also new mail to be sorted, if there is then return step (3-1), otherwise terminates.
5. process for sorting mailings according to claim 4, which is characterized in that step (3-3) is specifically, if mail belongs to
The probability and mail of spam belong to the ratio between probability of normal email cost > cost-sensitive factor lambda, then mail is determined as rubbish
Otherwise mail is determined as normal email.
6. process for sorting mailings according to claim 5, which is characterized in that step (3-4) is specifically, if cost > ginseng
Number δ, then it represents that the classification results have very high confidence level, then determine mail for the spam of high confidence level, wherein δ is
Much larger than the parameter of cost-sensitive factor λ value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510779256.1A CN106713108B (en) | 2015-11-13 | 2015-11-13 | A kind of process for sorting mailings of combination customer relationship and bayesian theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510779256.1A CN106713108B (en) | 2015-11-13 | 2015-11-13 | A kind of process for sorting mailings of combination customer relationship and bayesian theory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106713108A CN106713108A (en) | 2017-05-24 |
CN106713108B true CN106713108B (en) | 2019-08-13 |
Family
ID=58930716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510779256.1A Active CN106713108B (en) | 2015-11-13 | 2015-11-13 | A kind of process for sorting mailings of combination customer relationship and bayesian theory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106713108B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113344060B (en) * | 2021-05-31 | 2022-07-08 | 哈尔滨工业大学 | Text classification model training method, litigation state classification method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101674264A (en) * | 2009-10-20 | 2010-03-17 | 哈尔滨工程大学 | Spam detection device and method based on user relationship mining and credit evaluation |
CN102404249A (en) * | 2011-11-18 | 2012-04-04 | 北京语言大学 | Method and device for filtering junk emails based on coordinated training |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7590694B2 (en) * | 2004-01-16 | 2009-09-15 | Gozoom.Com, Inc. | System for determining degrees of similarity in email message information |
-
2015
- 2015-11-13 CN CN201510779256.1A patent/CN106713108B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101674264A (en) * | 2009-10-20 | 2010-03-17 | 哈尔滨工程大学 | Spam detection device and method based on user relationship mining and credit evaluation |
CN102404249A (en) * | 2011-11-18 | 2012-04-04 | 北京语言大学 | Method and device for filtering junk emails based on coordinated training |
Non-Patent Citations (1)
Title |
---|
"基于用户关系挖掘和信誉评价的垃圾邮件识别算法";王巍;《万方》;20121126;全文 |
Also Published As
Publication number | Publication date |
---|---|
CN106713108A (en) | 2017-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9171070B2 (en) | Method for classifying unknown electronic documents based upon at least one classificaton | |
CN103441924B (en) | A kind of rubbish mail filtering method based on short text and device | |
CN105005594B (en) | Abnormal microblog users recognition methods | |
Faguo et al. | Research on short text classification algorithm based on statistics and rules | |
US8112484B1 (en) | Apparatus and method for auxiliary classification for generating features for a spam filtering model | |
Katirai et al. | Filtering junk e-mail | |
Ning et al. | Spam message classification based on the Naïve Bayes classification algorithm | |
CN101540017B (en) | Feature extracting method based on byte level n-gram and twit filter | |
CN103136266A (en) | Method and device for classification of mail | |
CN101330473A (en) | Method and apparatus for filtrating network rubbish information supported by multiple protocols | |
CN103473218A (en) | Email classification method and email classification device | |
Ruskanda | Study on the effect of preprocessing methods for spam email detection | |
Woitaszek et al. | Identifying junk electronic mail in Microsoft outlook with a support vector machine | |
Sharma et al. | E-Mail Spam Detection Using SVM and RBF. | |
Bhat et al. | Classification of email using BeaKS: Behavior and keyword stemming | |
Deng et al. | Research on a naive bayesian based short message filtering system | |
Das et al. | Analysis of an image spam in email based on content analysis | |
Iyengar et al. | Integrated spam detection for multilingual emails | |
Khan et al. | Text mining approach to detect spam in emails | |
Anitha et al. | Email spam filtering using machine learning based xgboost classifier method | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types | |
CN106230690B (en) | A kind of process for sorting mailings and system of combination user property | |
CN106713108B (en) | A kind of process for sorting mailings of combination customer relationship and bayesian theory | |
CN110048936B (en) | Method for judging junk mail by semantic associated words | |
Anitha et al. | Email spam classification using neighbor probability based Naïve Bayes algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |