CN103136266A - Method and device for classification of mail - Google Patents

Method and device for classification of mail Download PDF

Info

Publication number
CN103136266A
CN103136266A CN2011103924898A CN201110392489A CN103136266A CN 103136266 A CN103136266 A CN 103136266A CN 2011103924898 A CN2011103924898 A CN 2011103924898A CN 201110392489 A CN201110392489 A CN 201110392489A CN 103136266 A CN103136266 A CN 103136266A
Authority
CN
China
Prior art keywords
mail
entry
classification
text
sorted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103924898A
Other languages
Chinese (zh)
Inventor
王艳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN2011103924898A priority Critical patent/CN103136266A/en
Publication of CN103136266A publication Critical patent/CN103136266A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for classification of mails. The method for the classification of the mails includes the steps: enabling the mails to be classified to be through text segmentation to get an entry set; matching entries in the entry set with feature words which represents categories of the mails in a feature word bank, and calculating conditional probability of the categories which the mails belong to according to a matching result; and confirming the categories of the mails according to the conditional probability. The method for the classification of the mails resolves the problems that in the prior art, the method for classification of the mails is less in quantity and low in accuracy, accordingly the effects of high-efficiency accurate classification of the mails and filtering of junk mails are achieved, and performance of a system is improved and user experience is also improved.

Description

The method of classification of mail and device
Technical field
The present invention relates to the communications field, in particular to a kind of method and device of classification of mail.
Background technology
Along with the progress in epoch, people's communication mode is also various, and mail has become in people's interchange and played the part of the key player as a kind of communication mode.The purposes of mail is increasingly extensive, and the kind of mail also increases day by day.As everyone knows, the weakness of current network security context and smtp protocol has caused producing a large amount of spams, and in the past few years, this situation grows in intensity.Undeniable, spam has been the most headachy problem in current e-mail system, spam is enough to allow enterprise and user sustain a great loss, more seriously, the harm of spam no longer is confined to Email content itself, more relate to network security, threaten if can't effectively control these, whole enterprise network is absorbed among the danger that suffers security attack.Therefore, various anti-spam technologies go out one after another in recent years.Wherein, filtration is a kind of technology of the simplest and direct disposal of refuse mail comparatively speaking.This technology is mainly used in the receiving system of mail and distinguishes and the disposal of refuse mail.Using general and simple filtering technique has: black and white lists technology, keyword technology, rules technology.
The black and white lists technology is difference known spam sender or sender IP address trusty or addresses of items of mail, if addresses of items of mail or IP address are in white list, just think legal mail, if spam is just thought in addresses of items of mail or IP address in blacklist.Also there is certain defective in this technology, because can not comprise the IP address of all (even if a large amount of) in black and white lists, and the spammer is easy to make rubbish by different IP address.
The keyword technology refers to judge that by setting some keywords current mail is the possibility size of spam.In general the keyword filtering technique need to build a series of lists of keywords according to the characteristics of spam, and this lists of keywords needs constantly to upgrade simultaneously.
Rule-based filtering technique, mainly come formation rule according to some feature (such as word, phrase, position, size, annex etc.), describe spam by these rules, make filtrator effective, just mean that managerial personnel will safeguard a huge rule base.
In prior art, the method of classification of mail is less and accuracy is low, caused in the situation that spam is increasing, can't carry out efficiently and filter accurately mail, people have been caused great inconvenience at the routine use mail, to processing waste a large amount of time of user of spam, the existence of spam also might cause security threat to the network of individual or enterprise.
Summary of the invention
The invention provides a kind of method and device of classification of mail, to solve at least in prior art, the method for classification of mail is less and accuracy is low, has caused in the situation that spam is increasing the problem that can't carry out efficiently and filter accurately mail.
According to an aspect of the present invention, provide a kind of method of classification of mail, having comprised: mail to be sorted has been carried out the text participle to obtain an entry collection; The Feature Words that characterizes mail classes in the entry that entry is concentrated and feature dictionary is complementary, and calculates according to matching result the conditional probability that mail belongs to classification; Determine the classification of mail according to conditional probability.
Preferably, mail to be sorted being carried out the text participle with before obtaining an entry collection, mail to be sorted is carried out following pre-service, comprising: by the keyword that comprises in sender address and/or mail, mail to be sorted is filtered; The title of the mail that extraction does not filter out and text are to form text; Text is carried out denoising.
Preferably, mail to be sorted being carried out the text participle with before obtaining an entry collection, set up by the following method the feature dictionary, comprising: calculation training is concentrated the entropy of the entry of mail; The entry that will belong to identical category sorts by entropy; Choose entropy greater than the entry of predetermined threshold value as characterizing the Feature Words of this mail classes in the feature dictionary.
Preferably, use Bayesian sorting algorithm to calculate the conditional probability that mail belongs to classification according to matching result.
Preferably, the Feature Words that characterizes mail classes in the entry that entry is concentrated and feature dictionary is complementary, and the conditional probability of using Bayesian sorting algorithm calculating mail to belong to classification according to matching result comprises: the Feature Words that characterizes mail classes in the entry that entry is concentrated and feature dictionary mates one by one, obtains the match entry collection; Calculate one by one the concentrated entry of match entry and belong to the first condition probability of the classification of mating, and belong to the second condition probability of classification according to first condition probability calculation match entry collection; The conditional probability that belongs to classification according to second condition probability calculation mail.
Preferably, adopt maximum matching method that mail to be sorted is carried out the text participle to obtain the entry collection.
Preferably, the classification of mail comprises: spam and normal email.
According to a further aspect in the invention, provide a kind of device of classification of mail, having comprised: word-dividing mode is used for mail to be sorted is carried out the text participle to obtain an entry collection; Matching module, the Feature Words that is used for the entry that entry is concentrated and feature dictionary sign mail classes is complementary; Computing module is used for calculating according to matching result the conditional probability that mail belongs to classification; Determination module is for determine the classification of mail according to conditional probability.
Preferably, the device of classification of mail also comprises: filter submodule, the keyword that is used for comprising by sender address and/or mail filters mail to be sorted; Extract submodule, be used for extracting the title of the mail that does not filter out and text to form text; The denoising submodule is used for text is carried out denoising.
Preferably, mail classes comprises: spam and normal email.
By the present invention, entry in employing extraction mail to be sorted and the Feature Words of feature dictionary carry out matching treatment, and calculate the conditional probability of classification under mail according to above-mentioned matching result, determine the class method for distinguishing of mail by above-mentioned conditional probability, solved in prior art, the problem that the method for classification of mail is less and accuracy is low, and then reached efficiently and accurately mail has been classified, filter the effect of spam, also improved user's experience when having promoted the performance of system.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram according to the method for the classification of mail of the embodiment of the present invention;
Fig. 2 is the schematic flow sheet of the method for classification of mail according to the preferred embodiment of the invention;
Fig. 3 is the process flow diagram of the method Chinese version participle of classification of mail according to the preferred embodiment of the invention;
Fig. 4 is according to the preferred embodiment of the invention based on the process flow diagram of the method for the classification of mail of Bayesian Classification Arithmetic;
Fig. 5 is the structured flowchart one according to the device of the classification of mail of the embodiment of the present invention; And
Fig. 6 is the structured flowchart two according to the device of the classification of mail of the embodiment of the present invention.
Embodiment
Hereinafter also describe in conjunction with the embodiments the present invention in detail with reference to accompanying drawing.Need to prove, in the situation that do not conflict, embodiment and the feature in embodiment in the application can make up mutually.
For there is no a kind of efficient and method that accurately mail is classified of energy in prior art, the invention provides a kind of method of classification of mail, as shown in Figure 1, be the process flow diagram according to the method for the classification of mail of the embodiment of the present invention.The method of this classification of mail comprises:
Step S102 carries out the text participle to obtain an entry collection with mail to be sorted;
Step S104, the Feature Words that characterizes mail classes in the entry that entry is concentrated and feature dictionary is complementary, and calculates according to matching result the conditional probability that mail belongs to classification;
Step S106 determines the classification of mail according to conditional probability.
Pass through the present embodiment, entry in employing extraction mail to be sorted and the Feature Words of feature dictionary carry out matching treatment, and the conditional probability of the affiliated classification of calculating mail, and determine the method for mail classes according to above-mentioned conditional probability, solved in prior art, the problem that the method for classification of mail is less and accuracy is low, and then reached efficiently and accurately mail is classified, filter the effect of spam, also improved user's experience in the time of the performance of Hoisting System.
Before step S102, can also carry out following pre-service to mail to be sorted, this preprocessing process can comprise: by the keyword that comprises in sender address and/or mail, mail to be sorted is filtered; The title of the mail that extraction does not filter out and text are to form text; Text is carried out denoising.
In implementation process, can filter mail to be sorted by existing keyword technology, also can by blacklist technical filter sender address, also two kinds of technology can be combined with.When disposal of refuse has part, two kinds of technology are combined with and can carry out double filtration to mail, follow-up need number of mail to be processed is reduced, and need can guarantee the fair all higher of mail to be processed.The title of the mail that stays after above-mentioned filtration and the text that body matter forms are carried out in extraction.The mail that stays after this filtration may be that title is that text formatting, text are the mail of picture format, may be also that title and text are all the mail of text formatting.Above-mentioned text is carried out denoising, namely substitute the phrase through making a variation in text with the phrase in default mapping table.This mapping table is used for storing the contrast relationship of variation text and normal text, for example, occur the text " political affairs % controls " " rotten %@loses " of process variation etc. in mail, can carry out record in mapping table, in order to be the normal text file to the text-converted after variation.In use, need to constantly update this mapping table, so that the content of mapping table is abundanter, satisfy the function of the variation text of distinguishing different spams.Mail to be sorted is carried out pre-service can make the classification of mail process ready work, the quality of mail is improved, the performance of Hoisting System.
Mail to be sorted being carried out the text participle with before obtaining an entry collection, can also set up by the following method a feature dictionary.The process of setting up that comprises the feature dictionary of different classes of Feature Words can comprise following processing:
(1) calculation training is concentrated the entropy of the entry of mail;
The entry that (2) will belong to identical category sorts by entropy;
(3) choose entropy greater than the entry of predetermined threshold value as characterizing the Feature Words of this mail classes in the feature dictionary.
In implementation process, the user can select some specific mails as training set in all mails of once receiving, can be also that the user asks a training set that has set in system.The classification of the mail in this training set is to formulate in advance, and the entry that extracts from different classes of mail just is decided to be the type of this entry, and when selected this entry was Feature Words, the classification of this Feature Words was just determined.Calculating the entropy of the entry that extracts according to the computing rule of entropy, the entry that belongs to identical category is sorted by entropy, can be descending, can be also ascending.The entry of completing sequence and predetermined threshold value (i.e. default entropy) are compared, during greater than predetermined threshold value, show that this entry possesses certain characteristic of division, joins this entry in the feature dictionary as Feature Words when the entropy of entry.Determine Feature Words by the method for calculating the entry entropy, make the serviceability of system features word higher.
In implementation process, can use Bayesian sorting algorithm to calculate the conditional probability that mail belongs to classification according to matching result, can comprise following processing: the Feature Words that characterizes mail classes in the entry that entry is concentrated and feature dictionary mates one by one, obtains the match entry collection; Calculate one by one the concentrated entry of match entry and belong to the first condition probability of the classification of mating, and belong to the second condition probability of classification according to first condition probability calculation match entry collection; The conditional probability that belongs to classification according to second condition probability calculation mail.
When needs carry out the text participle, also can adopt maximum matching method that mail to be sorted is carried out the text participle.Namely default field length to be matched, mate the field to be matched that extracts.For example, can select 4 Chinese characters as field to be matched, as not satisfying the coupling requirement, can remove the tail word, reformulate new field to be matched.
In the process that mail is classified, the classification of mail can comprise: spam and normal email also can be divided into medical domain, chemical field, biological field etc. to mail.Mail can be divided into dissimilar according to different classes of Feature Words.
Preferred embodiment
The present embodiment is chosen the user, and to select some specific mails in all mails of once receiving be example as training set, and whole process is made an explanation.As shown in Figure 2, be the schematic flow sheet of the method for classification of mail according to the preferred embodiment of the invention.The present embodiment has provided a kind of method that spam is classified, and mainly comprises two steps: classification based training and classification application.At first be training process, given training set carries out pre-service, then under the support of Chinese dictionary, the text of training set is carried out participle according to maximum matching method, then by the feature extraction dimensionality reduction, sets up the feature dictionary; Then assorting process, given mail to be sorted, the same with the training mail, through after the text participle, mate based on the bayesian algorithm of minimum risk, and calculate the classification of mail, export classification under this mail.Based on above-mentioned general thought, the below will do further explanation to each process.
Step S202 determines the sample post type.Because the training of this classification belongs to the training process that guidance is arranged, so, need to know in advance in training set, which mail belongs to normal email, and which mail belongs to spam, need manually mark.
Step S204, mail collection and pre-service.Email is a kind of semi-structured text, comprises mail header and text.Mail header is the summary of body matter normally, and text is the two main contents of sending out mutual of transmitting-receiving.One envelope E-mail sends from the user, delivers to smtp server by sending mail client program, then is forwarded to the purpose mailbox, and is last, coordinates account number, password to receive the mail of mailbox by the POP3 server program.
Wherein, the collection of mail and pre-service can comprise: " the maliciously address rule base " that formulate by user or system's self-studied ways (1), adopt simple address filtering or address filtering to add simple keyword coupling filter method a part of filtrating mail fallen.(2) mail that does not filter out is carried out pre-service, namely remove the structural information useless to classification of mail, only extract the title of mail and the text that text forms.(3) text that takes out is carried out denoising, for example, some spam is out of shape word, as " in political affairs altogether. control office ", " Falun Gong " is split as " Fa Chelun Gong " or " wheel of the law skill ".For such variation text, need to create in advance mapping table, by the mapping relations table, the text-converted of variation is become normal text.
Step S206, the text participle.The text participle is to be the text dividing of a mail significant Chinese entry under the support of Chinese dictionary.To all entries that obtain after the sample post text dividing, then by after " Feature Selection ", wherein a part of entry will be kept in the feature dictionary as the feature entry that mail is classified.
Wherein, the storage organization of Chinese dictionary formed by 4 Hash tables, wherein, the sublist that Hash table can be used as in mapping table is compared, also Hash table can be put into the feature dictionary, when whether the comparative feature word mates, carry out searching of Feature Words in Hash table.Above-mentioned Hash table is stored respectively four words, three words, two-character word and monosyllabic words, and the deposit position of entry in corresponding Hash table determined by the Hash codes of entry.If entry s is s[0] s[1] ... s[n-1] man's character string of forming, the Hash codes Hash_code of this entry is so: Hash_code=s[0] n-1+ s[1] n-2...+s[n-1] and (S[i in formula] be Chinese character S[i] the Unicode encoded radio).
The segmenting method that the embodiment of the present invention adopts is maximum matching method, the step of maximum matching method as shown in Figure 3, processing procedure is as follows:
Step S302, initialization data.Determine the maximum length of field to be matched.
Step S304 judges whether field to be matched exists match entry.If exist, execution in step S306, otherwise execution in step S308.For example, get front 4 men of current Chinese character sequence of text as matching field, search four words Hash tables, if in four words Hash tables, such a entry is arranged, continue execution in step S306, otherwise execution in step S308.
Step S306, the match is successful, Output rusults.Field to be matched is split out from current Chinese character sequence as an entry, puts into entry and concentrates.
Step S308 removes the end word of field to be matched, again coupling.For example, remove 4 last Chinese characters of Chinese character matching field, become a new matching field, then mate with entry in the corresponding three word Hash tables of dictionary.
Step S310 judges whether coupling.If coupling continues execution in step S306, otherwise execution in step S312.
Step S312 removes the end word of field to be matched, and coupling again is until coupling.
The entry maximum length that the text participle adopts and segmenting method are influential to the accuracy of text participle.The entry maximum length that the present invention adopts is 4 words, and for example, " Chinese name republic " will be split is 2 entries, i.e. " the Chinese people " and " republic "; But the maximum cutting length of entry is excessive, during the less entry of maximum matching method cutting length, will repeat repeatedly invalid cutting.The entry maximum length is accuracy and the requirement of real time that filtrating mail can be taken into account in 4 words.
Step S208, feature selecting.After some sample post text participles, need to concentrate the some feature entries choose the most suitable classification to put into the feature dictionary from entry, the classification of sample post is prior appointment, and the classification of the entry that is syncopated as from a sample post is the classification of this sample post namely; Certainly, the identical entry that is syncopated as from different classes of sample post will have a plurality of classes, use t i(c j) the entry t of expression entry collection record iAnd generic c jThe present invention adopts the algorithm based on entropy to come the selected characteristic entry, and this algorithm can comprise:
To entry collection { t i(c j) in whole entries count one by one entry t iBelong to class c jProbability P (c j| t i).If the entry of entry collection adds up to N, entry t i(c j) occurrence number be n ij, P (c j| t i)=n ij/ N.
Calculate one by one the entropy of entry E ( t i ( c j ) ) = - Σ j = 1 m P ( c j | t i ) log P ( c j | t i ) , Wherein m is the number of categories of sample post, is 2 (being spam and normal email) herein.
Sort from big to small by the value of entropy to belonging to of a sort whole entry, to class c jThreshold values λ is set j, with E (t i(c j))>=λ jWhole entries pack into (j=1,2..., m) in a Hash table of feature dictionary, obtain the feature dictionary that is formed by m Hash table.Entropy E (t i(c j)) value larger, entry t is described iImpact on classification of mail is larger.Threshold values λ jCan be according to belonging to c jThe entry quantity of class is definite, if the entry number is less, corresponding threshold values can be less, to guarantee the Hash table that is used for classifying, the Feature Words of sufficient amount is arranged.The expression of the feature dictionary that obtains image as shown in Table 1 and Table 2, actual storage be the Hash code value of entry.
Table 1 normal email dictionary
...
Computer technology
Course
Cooperation
...
Table 2 spam dictionary
...
The Political Bureau
Corrupt
Upheaval
...
Step S210 classifies to mail to be sorted.After any one mail to be sorted processing through step 2 and step 3, can obtain the entry collection of this mail text, the entry that mail text entry is concentrated one by one with the feature dictionary in Feature Words mate, then adopt based on Bayesian sorting algorithm and draw classification under this mail.Assorting process is as shown in Figure 4, and is in conjunction with Fig. 4, that the method for classification of mail is as follows:
Step S402, the entry of establishing mail text to be sorted integrates as T={t 1, t 2..., t k..., t n.
Step S404, the Feature Words that sequentially takes out one by one from entry collection T in the Hash table of entry and feature dictionary mates, and judges that whether entry integrates T as empty set.If so, execution in step S406, otherwise execution in step S408.
If step S406 is t kWith t i(c j) coupling, with classification c jGive t k, be designated as t k(c j).Execution in step S410.
If step S408 is t kIt fails to match with all Feature Words, gets next entry.Execution in step S404, until T is empty set, obtaining sorted entry collection is { t k(c j) (k=1,2 .., n; J=1,2 ... m).
Step S410 calculates the class conditional probability P (t of entry one by one k| c j)=P (c j| t k) P (t k)/P (c j).Wherein, if { t k(c j) entry add up to N, belong to class c jThe entry number be n j, P (c j)=n j/ N; If entry t kAt { t k(c j) in occurrence number be n k, P (t k)=n kIf/N is t k(c j) occurrence number be n kj, P (c j| t k)=n kj/ N.
Step S412, the class conditional probability of calculating text P ( T | c j ) = P ( t 1 | c j ) * P ( t 2 | c j ) * . . . * P ( t n | c j ) = Π k = 1 n P ( t k | c j ) .
Step S414 calculates the probability P (c that text T belongs to certain classification j| T)=P (c j) P (T|c j)/P (T), wherein, P (T)=P (c j) P (T|c j) summation (j=1 ..., m).
Step S416 gets max{P (c 1| T), P (c 2| T) ..., P (c m| T) } text categories be the classification of mail text T.
It fails to match if it is pointed out that tk and all Feature Words, will cause P (t k| c jAt this moment two problems can appear in)=0: one has produced a underestimation probability; Its two, every in step S412 is the relation of multiplying each other, if certain factor P (t wherein k| c j)=0, product is 0.For head it off, the present invention proposes the m-algorithm for estimating, formula is as follows: P (t k| c j)=(n kj+ mp)/(n j+ m), n in formula kj, n jWith previously defined identical, p is the prior estimate of the probability that will determine, and m is a constant, be called equivalent sample size, in the situation that lack the knowledge background of the prior probability of p, a kind of typical method is the probability that even priori is followed in supposition, that is to say, if k classification arranged, desirable p=1/k.Because only need to be divided into normal email and spam two classes to mail, the present invention adopts the bayes classification method based on minimum risk, uses the Least risk Bayes decision-making, at first needs to define decision table, and is as shown in table 3.
Table 3
The mail time of day Decision-making The decision-making loss
Spam Spam 0
Spam Normal email 1
Normal email Spam x
Normal email Normal email 0
The loss that spam is mistaken for normal email is made as 1, and the loss that normal email is mistaken for spam is x, and is larger because normal email is mistaken for the loss that spam causes, so x>1.The classification value of normal email and spam is respectively c j=0 and c j=1.
The conditional risk that mail T is categorized as spam is: R (spam|T)=0*P (c 1| T)+x*P (c 0| T)=x (1-P (c 1| T))
The conditional risk that is categorized as normal email is: R (legitimate|T)=1*P (c 1| T)+0*P (c 0| T)=P (c 1| T)
If T is spam, need to satisfy R (spam|T)<R (legitimate|T), that is, and x (1-P (c 1| T))<P (c 1| T) can get P (c 1| T)>1/ (x+1) namely, when satisfying the following formula requirement, is classification of mail that the risk of spam is less than the risk that is categorized as normal email.Experiment shows, x=9 can obtain more satisfactory classifying quality.
Therefore, in the actual classification process, can according to the Least risk Bayes formula, calculate P (c 1| T), bring formula P (t into k| c j)=(n kj+ mp)/(n j+ m) (p in formula is the value that Bayesian formula is calculated) can obtain the classification of mail result.
According to a further aspect in the invention, the embodiment of the present invention provides a kind of device of classification of mail, and as shown in Figure 5, this device comprises: word-dividing mode 10 is used for mail to be sorted is carried out the text participle to obtain an entry collection; Matching module 20, the Feature Words that is used for the entry that entry is concentrated and feature dictionary sign mail classes is complementary; Computing module 30 is used for calculating according to matching result the conditional probability that mail belongs to classification; Determination module 40 is for determine the classification of mail according to conditional probability.Word-dividing mode 10, matching module 20, computing module 30 and determination module 40 connect successively or are coupled.
As shown in Figure 6, the device of above-mentioned classification of mail can also comprise: filtering module 50, and the keyword that is used for comprising by sender address and/or mail filters mail to be sorted; Extraction module 60 be used for to extract the title of the mail that does not filter out and text to form text; Denoising module 70 is used for text is carried out denoising.Filtering module 50, extraction module 60, denoising module 70 connect successively or are coupled, and denoising module 70 is connected or is coupled with word-dividing mode 10.
Wherein, mail classes can comprise: spam and normal email.
In addition, the device of above-mentioned classification of mail can also comprise the feature dictionary.Wherein, can set up by the following method the feature dictionary, comprise: calculation training is concentrated the entropy of the entry of mail; The entry that will belong to identical category sorts by entropy; Choose entropy greater than the entry of predetermined threshold value as characterizing the Feature Words of this mail classes in the feature dictionary.
The Feature Words that the matching module 20 of said apparatus can also be used for the entry that entry is concentrated and feature dictionary sign mail classes mates one by one, obtains the match entry collection; Computing module 30 can also be used for calculating one by one the concentrated entry of match entry and belong to the first condition probability of the classification of mating, and belongs to the second condition probability of classification according to first condition probability calculation match entry collection; The conditional probability that belongs to classification according to second condition probability calculation mail.
As can be seen from the above description, the present invention has realized following technique effect:
The embodiment of the present invention has adopted based on the bayesian algorithm of minimum risk and has realized text classification, by the identification of Mail Contents being realized the automatic classification to mail, then the classification according to mail filters out spam, also can on purpose realize safe forwarding according to mail classes, reduced the risk that normal email is judged as spam when having improved accuracy, simultaneously, by the pretreated operation of mail, greatly improved the recall rate of spam.
obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in memory storage and be carried out by calculation element, and in some cases, can carry out step shown or that describe with the order that is different from herein, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step being made into the single integrated circuit module realizes.Like this, the present invention is not restricted to any specific hardware and software combination.
These are only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the method for a classification of mail, is characterized in that, comprising:
Mail to be sorted is carried out the text participle to obtain an entry collection;
The Feature Words that characterizes mail classes in the entry that described entry is concentrated and feature dictionary is complementary, and calculates according to described matching result the conditional probability that described mail belongs to described classification;
Determine the classification of described mail according to described conditional probability.
2. method according to claim 1, is characterized in that, mail to be sorted being carried out the text participle with before obtaining an entry collection, described mail to be sorted carried out following pre-service, comprising:
By the keyword that comprises in sender address and/or mail, described mail to be sorted is filtered;
The title of the mail that extraction does not filter out and text are to form text;
Described text is carried out denoising.
3. method according to claim 1 and 2, is characterized in that, mail to be sorted being carried out the text participle with before obtaining an entry collection, sets up by the following method described feature dictionary, comprising:
Calculation training is concentrated the entropy of the entry of mail;
The described entry that will belong to identical category sorts by described entropy;
Choose described entropy greater than the entry of predetermined threshold value as characterizing the Feature Words of this mail classes in described feature dictionary.
4. method according to claim 1, is characterized in that, uses Bayesian sorting algorithm to calculate the conditional probability that described mail belongs to described classification according to described matching result.
5. method according to claim 4, it is characterized in that, the Feature Words that characterizes mail classes in the entry that described entry is concentrated and feature dictionary is complementary, and the conditional probability of using the described mail of Bayesian sorting algorithm calculating to belong to described classification according to described matching result comprises:
The Feature Words that characterizes mail classes in the entry that described entry is concentrated and feature dictionary mates one by one, obtains the match entry collection;
Calculate one by one the first condition probability that the concentrated entry of described match entry belongs to the classification of mating, and match entry collection according to described first condition probability calculation belongs to the second condition probability of described classification;
The conditional probability that belongs to described classification according to the described mail of described second condition probability calculation.
6. method according to claim 1, is characterized in that, adopts maximum matching method that described mail to be sorted is carried out the text participle to obtain described entry collection.
7. method according to claim 1, is characterized in that, the classification of described mail comprises: spam and normal email.
8. the device of a classification of mail, is characterized in that, comprising:
Word-dividing mode is used for mail to be sorted is carried out the text participle to obtain an entry collection;
Matching module, the Feature Words that is used for the entry that described entry is concentrated and feature dictionary sign mail classes is complementary;
Computing module is used for calculating according to described matching result the conditional probability that described mail belongs to described classification;
Determination module is for determine the classification of described mail according to described conditional probability.
9. device according to claim 8, is characterized in that, the device of described classification of mail also comprises:
Filtering module, the keyword that is used for comprising by sender address and/or mail filters described mail to be sorted;
Extraction module be used for to extract the title of the mail that does not filter out and text to form text;
The denoising module is used for described text is carried out denoising.
10. device according to claim 8, is characterized in that, described mail classes comprises: spam and normal email.
CN2011103924898A 2011-12-01 2011-12-01 Method and device for classification of mail Pending CN103136266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103924898A CN103136266A (en) 2011-12-01 2011-12-01 Method and device for classification of mail

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103924898A CN103136266A (en) 2011-12-01 2011-12-01 Method and device for classification of mail

Publications (1)

Publication Number Publication Date
CN103136266A true CN103136266A (en) 2013-06-05

Family

ID=48496098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103924898A Pending CN103136266A (en) 2011-12-01 2011-12-01 Method and device for classification of mail

Country Status (1)

Country Link
CN (1) CN103136266A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103490979A (en) * 2013-09-03 2014-01-01 福建伊时代信息科技股份有限公司 Electronic mail identification method and system
CN103684991A (en) * 2013-12-12 2014-03-26 深圳市彩讯科技有限公司 Junk mail filtering method based on mail features and content
CN103984703A (en) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 Mail classification method and device
CN104063515A (en) * 2014-07-14 2014-09-24 福州大学 Spam message filtering method based on machine learning and used for social network
CN104809109A (en) * 2014-01-23 2015-07-29 腾讯科技(深圳)有限公司 Method and device for exhibiting social contact information as well as server
CN105339978A (en) * 2013-07-30 2016-02-17 惠普发展公司,有限责任合伙企业 Determining topic relevance of an email thread
CN105825367A (en) * 2016-03-16 2016-08-03 聚相投资管理(上海)有限公司 Cloud-end intelligent server and application of server in mail classification
CN105868183A (en) * 2016-05-09 2016-08-17 陈包容 Method and device for predicting staff demission
CN105975480A (en) * 2016-04-20 2016-09-28 广州精点计算机科技有限公司 Instruction identification method and system
CN106169974A (en) * 2016-07-05 2016-11-30 马岩 The gathering method of local mail data and system
CN106330670A (en) * 2016-08-18 2017-01-11 无锡云商通科技有限公司 Method for judging same mails based on mail finger prints
CN106357508A (en) * 2016-08-31 2017-01-25 成都启力慧源科技有限公司 Email classification method based on user behavior relationships
WO2017036341A1 (en) * 2015-09-03 2017-03-09 Huawei Technologies Co., Ltd. Random index pattern matching based email relations finder system
WO2018006256A1 (en) * 2016-07-05 2018-01-11 马岩 Local mail data collection method and system
WO2018014316A1 (en) * 2016-07-22 2018-01-25 王晓光 Method and system for collecting email data of local area network
CN108133009A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information storage means and device
CN108153728A (en) * 2017-12-22 2018-06-12 新奥(中国)燃气投资有限公司 A kind of keyword determines method and device
CN108230037A (en) * 2018-01-12 2018-06-29 北京深极智能科技有限公司 Advertisement base method for building up, ad data recognition methods and storage medium
CN108388601A (en) * 2018-02-02 2018-08-10 腾讯科技(深圳)有限公司 Sorting technique, storage medium and the computer equipment of failure
CN104951791B (en) * 2014-03-26 2018-10-09 华为技术有限公司 data classification method and device
CN112700081A (en) * 2020-11-26 2021-04-23 郑州大学 Label turning attack method based on entropy method
CN113886569A (en) * 2020-06-16 2022-01-04 腾讯科技(深圳)有限公司 Text classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1760901A (en) * 2005-11-03 2006-04-19 上海交通大学 System for filtering E-mails
CN1889108A (en) * 2005-06-29 2007-01-03 腾讯科技(深圳)有限公司 Method of identifying junk mail
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889108A (en) * 2005-06-29 2007-01-03 腾讯科技(深圳)有限公司 Method of identifying junk mail
CN1760901A (en) * 2005-11-03 2006-04-19 上海交通大学 System for filtering E-mails
CN101059796A (en) * 2006-04-19 2007-10-24 中国科学院自动化研究所 Two-stage combined file classification method based on probability subject
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105339978A (en) * 2013-07-30 2016-02-17 惠普发展公司,有限责任合伙企业 Determining topic relevance of an email thread
CN103490979B (en) * 2013-09-03 2016-09-14 福建伊时代信息科技股份有限公司 electronic mail identification method and system
CN103490979A (en) * 2013-09-03 2014-01-01 福建伊时代信息科技股份有限公司 Electronic mail identification method and system
CN103684991A (en) * 2013-12-12 2014-03-26 深圳市彩讯科技有限公司 Junk mail filtering method based on mail features and content
CN104809109B (en) * 2014-01-23 2019-12-10 腾讯科技(深圳)有限公司 social information display method and device and server
CN104809109A (en) * 2014-01-23 2015-07-29 腾讯科技(深圳)有限公司 Method and device for exhibiting social contact information as well as server
CN104951791B (en) * 2014-03-26 2018-10-09 华为技术有限公司 data classification method and device
CN103984703B (en) * 2014-04-22 2017-04-12 新浪网技术(中国)有限公司 Mail classification method and device
CN103984703A (en) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 Mail classification method and device
CN104063515A (en) * 2014-07-14 2014-09-24 福州大学 Spam message filtering method based on machine learning and used for social network
WO2017036341A1 (en) * 2015-09-03 2017-03-09 Huawei Technologies Co., Ltd. Random index pattern matching based email relations finder system
US10936638B2 (en) 2015-09-03 2021-03-02 Huawei Technologies Co., Ltd. Random index pattern matching based email relations finder system
CN105825367A (en) * 2016-03-16 2016-08-03 聚相投资管理(上海)有限公司 Cloud-end intelligent server and application of server in mail classification
CN105975480B (en) * 2016-04-20 2019-06-07 广东精点数据科技股份有限公司 A kind of instruction identification method and system
CN105975480A (en) * 2016-04-20 2016-09-28 广州精点计算机科技有限公司 Instruction identification method and system
CN105868183B (en) * 2016-05-09 2019-04-02 陈包容 A kind of method and device for predicting labor turnover
CN105868183A (en) * 2016-05-09 2016-08-17 陈包容 Method and device for predicting staff demission
WO2018006256A1 (en) * 2016-07-05 2018-01-11 马岩 Local mail data collection method and system
CN106169974A (en) * 2016-07-05 2016-11-30 马岩 The gathering method of local mail data and system
WO2018014316A1 (en) * 2016-07-22 2018-01-25 王晓光 Method and system for collecting email data of local area network
CN106330670A (en) * 2016-08-18 2017-01-11 无锡云商通科技有限公司 Method for judging same mails based on mail finger prints
CN106357508A (en) * 2016-08-31 2017-01-25 成都启力慧源科技有限公司 Email classification method based on user behavior relationships
CN108153728A (en) * 2017-12-22 2018-06-12 新奥(中国)燃气投资有限公司 A kind of keyword determines method and device
CN108133009A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information storage means and device
CN108153728B (en) * 2017-12-22 2021-05-25 新奥(中国)燃气投资有限公司 Keyword determination method and device
CN108230037A (en) * 2018-01-12 2018-06-29 北京深极智能科技有限公司 Advertisement base method for building up, ad data recognition methods and storage medium
CN108230037B (en) * 2018-01-12 2022-10-11 北京字节跳动网络技术有限公司 Advertisement library establishing method, advertisement data identification method and storage medium
CN108388601A (en) * 2018-02-02 2018-08-10 腾讯科技(深圳)有限公司 Sorting technique, storage medium and the computer equipment of failure
CN113886569A (en) * 2020-06-16 2022-01-04 腾讯科技(深圳)有限公司 Text classification method and device
CN113886569B (en) * 2020-06-16 2023-07-25 腾讯科技(深圳)有限公司 Text classification method and device
CN112700081A (en) * 2020-11-26 2021-04-23 郑州大学 Label turning attack method based on entropy method

Similar Documents

Publication Publication Date Title
CN103136266A (en) Method and device for classification of mail
US10785176B2 (en) Method and apparatus for classifying electronic messages
CN103441924B (en) A kind of rubbish mail filtering method based on short text and device
US7930351B2 (en) Identifying undesired email messages having attachments
US8010614B1 (en) Systems and methods for generating signatures for electronic communication classification
US8713014B1 (en) Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
US8429178B2 (en) Reliability of duplicate document detection algorithms
Katirai et al. Filtering junk e-mail
CN1889108B (en) Method of identifying junk mail
US7624274B1 (en) Decreasing the fragility of duplicate document detecting algorithms
CN103186845A (en) Junk mail filtering method
JP2006293573A (en) Electronic mail processor, electronic mail filtering method and electronic mail filtering program
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
CN106230690B (en) A kind of process for sorting mailings and system of combination user property
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
JP4686724B2 (en) E-mail system with spam filter function
US10163005B2 (en) Document structure analysis device with image processing
RU2583713C2 (en) System and method of eliminating shingles from insignificant parts of messages when filtering spam
CN107180022A (en) object classification method and device
Vinothkumar et al. Detection of spam messages in e-messaging platform using machine learning
Murugavel et al. K-Nearest neighbor classification of E-Mail messages for spam detection
CN106713108B (en) A kind of process for sorting mailings of combination customer relationship and bayesian theory
CN109670155A (en) A method of automatically replying the communication information
Frederic Text Mining applied to SPAM detection
Stone Parameterization of Naıve Bayes for Spam Filtering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130605