CN101119341A - Mail identifying method and apparatus - Google Patents

Mail identifying method and apparatus Download PDF

Info

Publication number
CN101119341A
CN101119341A CNA2007101546412A CN200710154641A CN101119341A CN 101119341 A CN101119341 A CN 101119341A CN A2007101546412 A CNA2007101546412 A CN A2007101546412A CN 200710154641 A CN200710154641 A CN 200710154641A CN 101119341 A CN101119341 A CN 101119341A
Authority
CN
China
Prior art keywords
mail
center
probability
spam
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101546412A
Other languages
Chinese (zh)
Other versions
CN101119341B (en
Inventor
王晖
林初仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2007101546412A priority Critical patent/CN101119341B/en
Publication of CN101119341A publication Critical patent/CN101119341A/en
Application granted granted Critical
Publication of CN101119341B publication Critical patent/CN101119341B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses an email recognition method and device. The method contains the following steps: the outward manifestation eigenvalue of the email is gained; the style of the email is judged according to the outward manifestation eigenvalue of the email; and the email is sent to the corresponding style of Bayesian filtering to be recognized whether junk email or not. The present invention also provides an email recognition device. As the outward manifestation eigenvalue of the email is gained firstly, and then the email is sent to the corresponding style of Bayesian filtering, thereby increasing the recognition precision of the email.

Description

Mail identifying method and device
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of mail identifying method and device.
Background technology
Spam is a kind of mail that just sends to by force in the subscriber mailbox of permitting without the user.For the stability and the fail safe of maintenance customer's mailbox, existing multiple mail recognition technology is used in the mailing system, such as, Bayes (Bayesian) filter.
Bayes filter is a kind of filter that adopts bayesian algorithm that mail is discerned, it at first carries out participle to Mail Contents and/or mail matter topics, obtain word segmentation result, again according to the sample storehouse, this mail that obtains each word correspondence in the word segmentation result is the probability of spam, at last, with this mail of each word correspondence is the probability of spam, bring the spam probability that Bayesian formula calculates this mail into, if the spam probability of this mail then is labeled as spam with this mail greater than predetermined threshold value.Wherein, can be according to the threshold value that is provided with of the specific requirement of mailing system, such as, threshold value can be made as 0.9, then the spam probability of mail is greater than 0.9, and then this mail will be marked as spam.
Because the sample storehouse of bayes filter obtains after to a certain amount of spam and the study of non-spam, and prior art, see also Fig. 1, use same bayes filter 101 to carry out mail identification to a plurality of mailbox users, just the mail to a plurality of users uses same sample storehouse to discern.Though use same sample storehouse to carry out mail identification to a plurality of mailbox users, spam with typicalness can be identified, but for some is the mail of spam to user A, and be the mail of non-spam for user B, bayes filter 101 is according to a sample storehouse, can only or be identified as spam with these mails, be identified as non-spam, if bayes filter 101 is identified as spam with these mails, so, for the above-mentioned recognition result of user B is incorrect, and vice versa.
Therefore, use a bayes filter that a plurality of users' mail is discerned, worn away the otherness between the user, the precision that causes discerning spam is not high, can not satisfy customer requirements.
Summary of the invention
The technical problem that the embodiment of the invention will solve provides a kind of mail identifying method and device, can improve the precision to the identification spam.
For solving the problems of the technologies described above, embodiment provided by the present invention is achieved through the following technical solutions:
The embodiment of the invention provides a kind of mail identifying method, comprising: the external manifestation characteristic value of obtaining mail; According to the external manifestation characteristic value of mail, judge the type of described mail; Whether the bayes filter that described mail is sent to described type correspondence is discerned this mail is spam.
Preferably, said method further comprises:
Obtain the external manifestation characteristic value of sample post;
According to described external manifestation characteristic value, utilize the k-means algorithm, from sample post, select n center mail, the corresponding class sample post of each center mail;
Class sample post with each center mail correspondence is trained bayes filter as the sample storehouse;
Wherein, comprise in the described sample post: be marked as the mail of spam and the mail of non-spam.
Preferably, described in the said method according to the external manifestation characteristic value of mail, judge that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
With of the distance ordering of described mail, select and the shortest center mail of described mail distance to n predetermined center mail.
Preferably, described in the said method according to the external manifestation characteristic value of mail, judge that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
With of the distance ordering of described mail to n predetermined center mail, i center mail of ascending selection, i is the integer more than or equal to 2.
Preferably, whether be spam specifically comprise to the bayes filter that described mail is sent to described type correspondence in said method if discerning this mail:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
According to described mail to the distance L of selected i center mail, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
Described mail is sent to the bayes filter of selected i center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P i
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, calculate the weighting spam probability of described mail, with described weighting spam probability and second threshold value that presets relatively,, then described mail is labeled as spam if be higher than described second threshold value.
Preferably, described in the said method according to the external manifestation characteristic value of mail, judge that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
According to the distance L of described mail to described predetermined n center mail, calculate mail and n the center mail of being scheduled to apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
With described apart from probability sorting, descending successively will be apart from the probability addition, select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value.
Preferably, whether be spam specifically comprise to the bayes filter that described mail is sent to described type correspondence in said method if discerning this mail:
Described mail is sent to the bayes filter of selected center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P j, j is the quantity of selected center mail;
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, calculate the weighting spam probability of described mail, with weighting spam probability and predetermined the 3rd threshold value relatively,, then described mail is labeled as spam if greater than described the 3rd predetermined threshold value.
Preferably, in the distance L that arrives predetermined center mail described in the said method according to mail, the calculating mail specifically comprises apart from probability with the center mail of being scheduled to:
Calculate the inverse of each mail respectively to center mail distance L, and with described each mail to the inverse of center mail distance L divided by described each mail to the inverse of center mail distance L and.
Preferably, if the user has changed the recognition result of mail, then described method further comprises in said method:
Content and/or theme to described mail carry out participle, train bayes filter again.
The embodiment of the invention also provides a kind of mail recognition device, comprising:
Feature extraction unit is used to obtain the external manifestation characteristic value of mail;
The type judging unit is used for the external manifestation characteristic value according to mail, judges the type of described mail;
Mailing List unit is used for described mail is sent to the bayes filter of described type correspondence;
Bayes filter is used to calculate the spam probability P of described mail;
The mail recognition unit is used for according to described spam probability P, judges whether described mail is spam, if then described mail is labeled as spam.
Preferably, said apparatus further comprises:
Center mail selected cell is used to obtain the external manifestation characteristic value of sample post, according to described external manifestation characteristic value, utilizes the k-means algorithm, selects described n center mail from sample post, the corresponding class sample post of each center mail;
The bayes filter training unit is used for training bayes filter with a class sample post of each center mail correspondence as the sample storehouse;
Wherein, comprise in the described sample post: be marked as the mail of spam and the mail of non-spam.
Preferably, specifically comprise: first sequencing unit, first selected cell at judging unit described in the said apparatus;
Described first sequencing unit is used for calculating the distance L that described mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
Described first selected cell is used to select and the shortest center mail of described mail distance.
Preferably, specifically comprise at the judging unit of type described in the said apparatus: second sequencing unit, second selected cell;
Described second sequencing unit is used for calculating the distance L that described mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
Described second selected cell is used at least two center mails of ascending selection.
Preferably,, specifically be used for described mail is sent to the bayes filter of selected i center mail correspondence, obtain the spam probability P of described mail in the unit of Mailing List described in the said apparatus 1, P 2... .P i
Described mail recognition unit specifically comprises: first apart from the probability calculation unit, the first spam recognition unit;
The described first probability calculation unit is used for according to described mail to the distance L of selected i center mail, according to preset apart from the probability calculation rule, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
The described first mail recognition unit is used for
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, calculate the weighting spam probability of described mail, with described weighting spam probability and second threshold value that presets relatively,, then described mail is labeled as spam if be higher than described second threshold value.
Preferably, specifically comprise at the judging unit of type described in the said apparatus: second distance probability calculation unit, the 3rd selected cell;
Described second distance probability calculation unit, be used for according to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail, and according to the distance L of described mail to described predetermined n center mail, according to predetermined apart from the probability calculation rule, calculate described mail to described predetermined n center mail apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
Described the 3rd selected cell is used for described apart from probability sorting, descendingly will select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value apart from the probability addition successively.
Preferably, specifically be used at the recognition unit of mail described in the said apparatus: foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, calculate the weighting spam probability of described mail, with weighting spam probability and predetermined threshold value relatively,, then described mail is labeled as spam if greater than described predetermined threshold value, j is the quantity of selected center mail.
Technique scheme has following beneficial effect:
In the technical scheme that the embodiment of the invention provides, according to the external manifestation characteristic value of mail, judge the described type of mail, and this mail be sent to the bayes filter of described type correspondence whether discern this mail is spam.Because in embodiments of the present invention,, determined the classification of mail, and this mail has been sent to the corresponding bayes filter of the type, improved the accuracy of identification of mail at first according to the external manifestation characteristic value of mail.
Description of drawings
The logical schematic of the mail recognition technology scheme that Fig. 1 provides for prior art;
The logical schematic of the mail recognition technology scheme that Fig. 2 provides for the embodiment of the invention;
The mail identifying method flow chart that Fig. 3 provides for first embodiment of the invention;
The mail identifying method flow chart that Fig. 4 provides for second embodiment of the invention;
The mail identifying method flow chart that Fig. 5 provides for third embodiment of the invention;
The center mail system of selection flow chart that Fig. 6 provides for the embodiment of the invention;
Fig. 7 forms schematic diagram for the mail recognition device that fourth embodiment of the invention provides;
Fig. 8 forms schematic diagram for the mail recognition device that fifth embodiment of the invention provides;
Fig. 9 forms schematic diagram for the mail recognition device that sixth embodiment of the invention provides;
Figure 10 forms schematic diagram for the mail recognition device that seventh embodiment of the invention provides.
Embodiment
For the purpose that makes the embodiment of the invention, technical scheme, and advantage clearer, below the embodiment of the invention is elaborated with reference to accompanying drawing.
See also Fig. 2, the logical schematic of the mail recognition technology scheme that provides for the embodiment of the invention, as known in the figure, the basic thought of the mail recognition technology scheme that the embodiment of the invention provides is: at first, obtain the external manifestation characteristic value of the mail that mail server receives, then the type of this mail is judged, secondly, whether this mail, discerning this mail is spam if being sent to the bayes filter of such mail correspondence.Wherein, the number of bayes filter equals the number of email type.
See also Fig. 3, the flow chart of the mail identifying method that provides for first embodiment of the invention comprises:
Step 301: obtain the external manifestation characteristic value of mail,, calculate the distance L that this mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value;
In embodiments of the present invention, the external manifestation characteristic value of described mail refers to: length, mail coding etc. does not relate to the feature of mail matter topics and content.
Suppose: the mail external manifestation characteristic value of obtaining comprises: length x and mail coding y, the then distance of two mails L = ( x 1 - x 2 ) 2 + ( y 1 - y 2 ) 2 , Wherein, x1, x2 are respectively the length of two mails, and y1, y2 are respectively the mail coding of two mails;
Step 302:,, select and the shortest center mail of this mail distance according to the shortest principle of distance with of the distance ordering of this mail to n predetermined center mail;
Step 303: this mail is sent to the bayes filter of the center mail correspondence of selecting in the step 302, calculates the spam probability of this mail;
Step 304: spam probability and first threshold value of being scheduled to are compared,, then this mail is labeled as spam if be higher than described first threshold value.
In embodiments of the present invention, the corresponding a kind of email type of each center mail, and corresponding bayes filter.Such as: according to length mail is classified, then the first center mail can corresponding length be the classification of mail of 100bit, the corresponding length of the second center mail is the classification of mail of 10000bit, if the distance of the mail and the second center mail is the shortest, illustrate that then this mail belongs to the classification of mail that length is 10000bit.
See also Fig. 4, the flow chart of the mail identifying method that provides for second embodiment of the invention comprises:
Step 401: obtain the external manifestation characteristic value of mail,, calculate the distance L that this mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value;
Step 402:,, select i center mail according to the ascending order of distance with the distance ordering of mail to n predetermined center mail;
During specific implementation, the technical staff can be according to system requirements, and i is redefined for arbitrarily integer more than or equal to 2.
Step 403: according to described mail to the distance L of selected i center mail and predetermined apart from the probability calculation rule, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
What be scheduled in embodiments of the present invention, apart from the probability calculation rule can be:
Q k = 1 / L K 1 L 1 + 1 L 2 + . . . + 1 L i , Wherein, K=1,2 ... i; In other embodiments of the invention, it is above-mentioned apart from probability also can to adopt other formula to calculate, and does not influence the realization of the embodiment of the invention.
Step 404: mail is sent to the bayes filter of selected i center mail correspondence, obtains the spam probability P of mail 1, P 2... .P i
Step 405: foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, the weighting spam probability of calculating mail compares weighting spam probability and predetermined second threshold value, if be higher than described second threshold value, then described mail is labeled as spam.
Because a mail may and not exclusively belong to a certain class mail, therefore, in a second embodiment, by will be apart from sorting, select the email type of this mail, and mail is sent to the bayes filter of selecting a plurality of centers mail correspondence discern, improved the precision of mail identification.
See also Fig. 5, the mail identifying method flow chart for third embodiment of the invention provides comprises:
Step 501: obtain the external manifestation characteristic value of mail,, calculate the distance L that this mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value;
Step 502: according to this mail to the distance L of n predetermined center mail and predetermined apart from the probability calculation rule, calculate this mail to n the center mail of being scheduled to apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
Step 503: with this mail to n predetermined center mail apart from probability Q 1, Q 2... .Q nOrdering, descending successively will be apart from the probability addition, select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value;
Such as, predetermined apart from the probability threshold value is: 90%, descending will be apart from the probability addition, if come front three apart from probability and greater than 90%, then select these three center mails apart from the probability correspondence.
Step 504: mail is sent to the bayes filter of the center mail correspondence that step 503 selects, obtains the spam probability P of described mail 1, P 2... .P j, j is the number of the center mail selected in the step 503;
Step 505: foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, the weighting spam probability of calculating mail compares weighting spam probability and the 3rd threshold value of being scheduled to, if be higher than described the 3rd threshold value, then described mail is labeled as spam.
In the 3rd embodiment, employing will be selected the email type of this mail apart from the mode of probability addition, and mail is sent to the bayes filter of selecting a plurality of centers mail correspondence discern, and improve the precision of mail identification.
Wherein, when specific implementation, the requirement that the technical staff can the different mail system, the predetermined threshold value, the threshold value span is: greater than 0 smaller or equal to 1 number.
Below specifically introduce the flow process of the mail identifying method that the embodiment of the invention provides, introduced the implementation procedure of said method below in conjunction with instantiation.
With predetermined 3 center mails is example, and the specific implementation process of the present invention first, second and third embodiment is described.
Suppose: 3 predetermined center mails are respectively: a1, a2, a3; Mail b divides the distance that is clipped to above-mentioned 3 center mails to be: 2,3,4.
First embodiment of the invention: with above-mentioned three distance orderings, because mail b is the shortest to the distance of center mail a1, then mail b is sent to the bayes filter of mail a1 correspondence, calculates the spam probability of this mail, and whether the identification mail is spam.
Second embodiment of the invention: if set in advance 2 center mails of ascending selection, then after above-mentioned three distances are sorted, selection center mail a1 and a2, and mail b is sent to the bayes filter of center mail a1 correspondence and the bayes filter of center mail a2 correspondence, obtain two spam probability P 1, P 2According to the distance of mail b to mail a1 and mail a2, calculate mail b to mail a1 apart from probability Q 1 = 1 2 / ( 1 2 + 1 3 ) = 0.6 , Mail b to mail a2 apart from probability Q 2 = 1 3 / ( 1 2 + 1 3 ) = 0.4 ; Calculate the weighting mail probability=0.6*P of mail 1+ 0.4*P 2, this weighting mail probability and predetermined second threshold value are compared, whether discern this mail is spam.
Third embodiment of the invention: if set in advance be: 80% apart from the probability threshold value; Calculate mail b to mail a1 apart from probability Q 1 = 1 2 / ( 1 2 + 1 3 + 1 4 ) = 0.46 , In like manner can get, mail b to mail a2 apart from probability Q 2=0.31, to mail a3 apart from probability Q 3=0.23, with above-mentioned three apart from after the probability sorting, addition from big to small, make first apart from probability and greater than 80% center mail be: a1, a2 and a3, therefore, mail b is sent to center a1, and the bayes filter of a2 and a3 correspondence obtains three spam probability P 1, P 2, P 3The weighting mail probability=0.46*P of mail 1+ 0.31*P 2+ 0.23*P 3, this weighting mail probability and the 3rd threshold value of being scheduled to are compared, whether discern this mail is spam.
In addition, in embodiments of the present invention, after mail is labeled as spam, can be further according to the spam probability or the weighting spam probability of mail, selection is to the handling process of this spam, such as, can be during greater than certain threshold value at spam probability or weighting spam probability, spam is directly abandoned, and the spam that will be lower than this threshold value is delivered to user's spam inbox.
Abovely the mail identifying method that the embodiment of the invention provides is introduced in conjunction with instantiation, because in the method that the embodiment of the invention provides, need set in advance n center mail, as the benchmark of judging email type, below specifically introduce the method for selection n the center mail that the embodiment of the invention provides.
In embodiments of the present invention, adopt the k-means algorithm to select n center mail, see also Fig. 6, this method comprises:
Step 601: in m sample post, select n center mail as the center mail at random, n is the integer more than or equal to 2;
Wherein, comprise in the sample post: be identified as the mail of spam and the mail of non-spam; The mail number of Xuan Zeing has determined mail that the how many kinds of classification is arranged at random.Such as, in 100,000 these mails of approved sample, randomly draw 1000 envelope mails, so, this 100,000 envelope mail will be divided into 1000 kinds.
Step 602: the external manifestation characteristic value of obtaining sample post;
Step 603:, calculate the distance that all the other m-n sample post branches are clipped to a said n center mail according to described external manifestation characteristic value;
Step 604: more same mail according to the shortest principle of distance, will be divided into a class with it apart from the shortest center mail in this mail and described n center mail to the distance of n center mail; Repeated execution of steps 604 is to arriving all sample posts classification;
Step 605:,, obtain external manifestation feature mean value then divided by the number of mail in the type with the corresponding addition of the external manifestation characteristic value of all mails in the same class mail;
Step 606: the external manifestation characteristic value of each mail in the same class mail and the external manifestation feature mean value of the type are compared, and the difference reckling is the new center mail of such mail; Repeated execution of steps 606 is to getting access to n new center mail;
According to the n that selects in the step 606 new center mail, execution in step 602 to 604 once more, m-n sample post reclassified, then, execution in step 606 is calculated the center mail of the every class mail after reclassifying, repeat said process, when being same mail, these center mails are chosen as the benchmark of judging email type to center mail to the every class mail that repeatedly calculates.
Cite a plain example the process of above-mentioned selection mail is described.
Suppose: sample post is 10 envelopes, and the external manifestation characteristic value of the mail that obtains is: length value and mail encoded radio; Picked at random 2 envelope mails: mail a and mail f, calculate the distance that other 8 envelope mail branches are clipped to mail a and mail f, mail is classified first, if classification results is that first-class mail comprises: a, b, c, d, e, five envelope mails; Second-class mail comprises f, g, h, i, j, five envelope mails; Then, with a, b, c, d, e, the length addition of five envelope mails, and will with the value divided by 5, obtain the length mean value of first-class mail, in like manner calculate the mail coding mean value of first-class mail, the external manifestation characteristic value and the above-mentioned mean value of five envelope mails in the first-class mail is compared, if it is the most approaching to obtain the external manifestation characteristic value and the mean value of c mail, then mail c is new center mail; In like manner obtaining the new center mail of second-class mail is j; With mail c and mail j is new center mail, calculates the distance that other 8 envelope mail branches are clipped to mail c and mail j, and mail is classified again, if classification results is: first-class mail comprises: c, a, d, f, h, second-class mail comprises: b, e, g, i, j then, calculates the mean value of the external manifestation characteristic value of every class mail again, the center mail of the every class mail after acquisition reclassifies is classified to mail according to new center mail again ...; Repeating above-mentioned flow process to the center mail to the every class mail that repeatedly calculates no longer changes, such as, the center mail that calculates first-class mail as if continuous 5 times is the c mail, the center mail of second-class mail is the f mail, then select mail c and mail f as the benchmark of judging email type, and will be divided into the sample post of the mail of a class with mail c, will be divided into the sample post of the mail of a class with the f mail as the corresponding bayes filter of training second-class mail as the corresponding bayes filter of training first-class mail.
In other embodiments of the invention, also can adopt the c-means algorithm from sample post, to select the center mail, do not influence the realization of the embodiment of the invention.
By the process of above-mentioned selection center mail as can be known, in embodiments of the present invention, be mail to be classified with the external manifestation characteristic value of mail.
In embodiments of the present invention, the corresponding bayes filter of each center mail, the corresponding bayes filter of just every class center mail.Below the concrete training bayes filter detailed process of introducing.
By the preamble narration as can be known, after selecting the center mail, the mail that is divided into a class with the center mail will be as the sample post of the corresponding bayes filter of this center mail of training.
Be example with preamble give an actual example, suppose: the mail that is divided into a class with mail c is a, b, and d, h, as follows at the process of above-mentioned sample post training bayes filter:
To a, b, c, d, the theme of h mail and/or content are carried out participle, set up a sample storehouse, the corresponding word of each bar record, the information of record comprises: word length, word, the spam number (BadCount) that comprises this word comprises the normal email number (GoodCount) of this word, and preserves the number (BadEmailCount) of the spam in above-mentioned five these mails of approved sample and the number (GoodEmailCount) of normal email.
Therefore, in embodiments of the present invention,, trained bayes filter respectively, so the quantity of bayes filter is by the species number decision of mail at inhomogeneous mail.
Below specifically introduce bayes filter and calculate the process of spam probability, this process comprises:
Utilize participle technique that the theme and/or the Mail Contents of mail are divided into several words; In the sample storehouse, find the number of times (GoodCount) that occurs in number of times (BadCount) that above-mentioned several words occur and the non-spam respectively in spam, and, obtain spam sample number (BadEmailCount) and normal email sample number (GoodEmailCount) in the sample storehouse
Suppose: the A incident: mail is a spam, t1, and t2...., tn are the word segmentation result of mail;
When word ti occurring in P (A|i) the expression mail, this mail is the probability of spam.
P (A|ti) also can abbreviate the rubbish probability of word ti as.Obviously,
P(A|ti)=(BadCount/BadEmailCount)/((GoodCount/GoodtEmailCount)+(BadCount/BadEmailCount))
Suppose: t1, mail became the probability of spam and is when these words of t2...tr occurred: P1, P2...Pr
P (A|t1, t2, t3 ..., tr) be illustrated in and occur word t1 in the mail simultaneously, t2 ..., during tr, this mail is the probability of spam.
According to Bayesian formula:
P (A|t1, t2, t3 ..., tr)=(P1*P2*...Pr)/[P1*P2*...Pn+ (1-P1) * (1-P2) * ... (1-Pr)], calculate the spam probability of this mail.
The process of training bayes filter and the process of the spam probability that Bayes calculates mail have more than been narrated.
Fourth embodiment of the invention also provides a kind of mail recognition device, sees also Fig. 7, comprising:
Feature extraction unit 701 is used to obtain the external manifestation characteristic value of mail;
Type judging unit 702 is used for the external manifestation characteristic value according to mail, judges the type of described mail;
Mailing List unit 703 is used for described mail is sent to the bayes filter of described type correspondence;
Bayes filter 704 is used to calculate the spam probability P of described mail;
Mail recognition unit 705 is used for according to described spam probability P, judges whether described mail is spam, if then described mail is labeled as spam;
In embodiments of the present invention, according to the K-means algorithm sample post is classified in advance, obtain the starting type of mail, therefore, said apparatus further comprises:
Center mail selected cell 706 is used to obtain the external manifestation characteristic value of sample post, according to described external manifestation characteristic value, utilizes the k-means algorithm, selects n center mail from sample post, the corresponding class sample post of each center mail;
Bayes filter training unit 707 is used for training bayes filter with a class sample post of each center mail correspondence as the sample storehouse;
Wherein, comprise in the described sample post: be marked as the mail of spam and the mail of non-spam.
Fifth embodiment of the invention also provides a kind of mail recognition device, sees also Fig. 8, and the difference of this embodiment and the 4th embodiment only is:
Type judging unit 802 specifically comprises: first sequencing unit, 8021, the first selected cells 8022;
First sequencing unit 8021 is used for calculating the distance L that this mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
First selected cell 8022 is used to select and the shortest center mail of described mail distance;
Mailing List unit 803 is sent to the bayes filter of the center mail correspondence that first selected cell 8021 selects with this mail, calculates the spam probability of this mail;
Mail recognition unit 805 is used for spam probability and first threshold value of being scheduled to are compared, if be higher than described first threshold value, then this mail is labeled as spam.
Sixth embodiment of the invention also provides a kind of mail recognition device, sees also Fig. 9, and the difference of this embodiment and the 4th and the 5th embodiment only is:
Type judging unit 902 specifically comprises: second sequencing unit, 9021, the second selected cells 9022;
Second sequencing unit 9021 is used for calculating the distance L that described mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
Second selected cell 9022 is used at least two center mails of ascending selection.
Mailing List unit 903 specifically is used for described mail is sent to the bayes filter 904 of selected i center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P i
Mail recognition unit 905 specifically comprises: first apart from probability calculation unit 9051, the first spam recognition units 9052;
The first probability calculation unit 9051 is used for according to described mail to the distance L of selected i center mail, according to preset apart from the probability calculation rule, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
The described first mail recognition unit 9052 is used for
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, calculate the weighting spam probability of described mail, with described weighting spam probability and second threshold value that presets relatively,, then described mail is labeled as spam if be higher than described second threshold value.
Seventh embodiment of the invention also provides a kind of mail recognition device, sees also Figure 10, and the difference of this embodiment and the 4th, the 5th and the 6th embodiment only is:
Type judging unit 112 specifically comprises: second distance probability calculation unit 1121, the three selected cells 1122;
Second distance probability calculation unit 1121 is used for according to the distance L of described mail to described predetermined n center mail, according to predetermined apart from the probability calculation rule, calculate described mail arrive described predetermined n center mail apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
The 3rd selected cell 1122 is used for described apart from probability sorting, descendingly will select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value apart from the probability addition successively;
Mailing List unit 113 is used for mail is sent to the bayes filter 114 of the center mail correspondence that the 3rd selected cell 1122 selects, and obtains the spam probability P of described mail 1, P 2... .P j, j is the number of the center mail selected;
Mail recognition unit 115 specifically is used for:
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, calculate the weighting spam probability of described mail, with weighting spam probability and predetermined threshold value relatively,, then described mail is labeled as spam if greater than described predetermined threshold value, j is the quantity of selected center mail.
More than mail identifying method and device that the embodiment of the invention is provided describe in detail.
In order to improve the accuracy of mail identification, if mail identifying method that the embodiment of the invention provides or device are labeled as spam with mail, and with the spam inbox of this mail delivery to the user, and the user is non-spam with this spam tagging after reading this mail, perhaps, with non-spam tagging is spam, be under the wrong situation of recognition result, then in the technical scheme that the embodiment of the invention provides, further comprise: content and/or the theme of discerning wrong mail carried out participle, train bayes filter again.
More than mail identifying method provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the invention, part in specific embodiments and applications all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (16)

1. a mail identifying method is characterized in that, comprising:
Obtain the external manifestation characteristic value of mail;
According to the external manifestation characteristic value of mail, judge the type of described mail;
Whether the bayes filter that described mail is sent to described type correspondence is discerned this mail is spam.
2. the method for claim 1 is characterized in that, described method further comprises:
Obtain the external manifestation characteristic value of sample post;
According to described external manifestation characteristic value, utilize the k-means algorithm, from sample post, select n center mail, the corresponding class sample post of each center mail;
Class sample post with each center mail correspondence is trained bayes filter as the sample storehouse;
Wherein, comprise in the described sample post: be marked as the mail of spam and the mail of non-spam.
3. method as claimed in claim 2 is characterized in that, described external manifestation characteristic value according to mail judges that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
With of the distance ordering of described mail, select and the shortest center mail of described mail distance to n predetermined center mail.
4. method as claimed in claim 2 is characterized in that, described external manifestation characteristic value according to mail judges that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
With of the distance ordering of described mail to n predetermined center mail, i center mail of ascending selection, i is the integer more than or equal to 2.
5. method as claimed in claim 4 is characterized in that, whether the bayes filter that described mail is sent to described type correspondence is discerned this mail is that spam specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
According to described mail to the distance L of selected i center mail, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
Described mail is sent to the bayes filter of selected i center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P i
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, calculate the weighting spam probability of described mail, with described weighting spam probability and second threshold value that presets relatively,, then described mail is labeled as spam if be higher than described second threshold value.
6. method as claimed in claim 2 is characterized in that, described external manifestation characteristic value according to mail judges that the type of described mail specifically comprises:
According to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail;
According to the distance L of described mail to described predetermined n center mail, calculate mail and n the center mail of being scheduled to apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
With described apart from probability sorting, descending successively will be apart from the probability addition, select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value.
7. method as claimed in claim 6 is characterized in that, whether the bayes filter that described mail is sent to described type correspondence is discerned this mail is that spam specifically comprises:
Described mail is sent to the bayes filter of selected center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P j, j is the quantity of selected center mail;
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, calculate the weighting spam probability of described mail, with weighting spam probability and predetermined the 3rd threshold value relatively,, then described mail is labeled as spam if greater than described the 3rd predetermined threshold value.
8. as claim 5 or 7 described methods, it is characterized in that described according to the distance L of mail to predetermined center mail, the calculating mail specifically comprises apart from probability with the center mail of being scheduled to:
Calculate the inverse of each mail respectively to center mail distance L, and with described each mail to the inverse of center mail distance L divided by described each mail to the inverse of center mail distance L and.
9. method as claimed in claim 8 is characterized in that, if the user has changed the recognition result of mail, then described method further comprises:
Content and/or theme to described mail carry out participle, train bayes filter again.
10. a mail recognition device is characterized in that, comprising:
Feature extraction unit is used to obtain the external manifestation characteristic value of mail;
The type judging unit is used for the external manifestation characteristic value according to mail, judges the type of described mail;
Mailing List unit is used for described mail is sent to the bayes filter of described type correspondence;
Bayes filter is used to calculate the spam probability P of described mail;
The mail recognition unit is used for according to described spam probability P, judges whether described mail is spam, if then described mail is labeled as spam.
11. device as claimed in claim 10 is characterized in that, described device further comprises:
Center mail selected cell is used to obtain the external manifestation characteristic value of sample post, according to described external manifestation characteristic value, utilizes the k-means algorithm, selects described n center mail from sample post, the corresponding class sample post of each center mail;
The bayes filter training unit is used for training bayes filter with a class sample post of each center mail correspondence as the sample storehouse;
Wherein, comprise in the described sample post: be marked as the mail of spam and the mail of non-spam.
12. device as claimed in claim 11 is characterized in that, described type judging unit specifically comprises: first sequencing unit, first selected cell;
Described first sequencing unit is used for calculating the distance L that described mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
Described first selected cell is used to select and the shortest center mail of described mail distance.
13. device as claimed in claim 11 is characterized in that, described type judging unit specifically comprises: second sequencing unit, second selected cell;
Described second sequencing unit is used for calculating the distance L that described mail branch is clipped to predetermined n center mail according to described external manifestation characteristic value, and with the distance ordering of described mail to n predetermined center mail;
Described second selected cell is used at least two center mails of ascending selection.
14. device as claimed in claim 13 is characterized in that,
Described Mailing List unit specifically is used for described mail is sent to the bayes filter of selected i center mail correspondence, obtains the spam probability P of described mail 1, P 2... .P i
Described mail recognition unit specifically comprises: first apart from the probability calculation unit, the first spam recognition unit;
The described first probability calculation unit is used for according to described mail to the distance L of selected i center mail, according to the distance that presets both rate computation rule, calculate described mail arrive selected center mail apart from probability Q 1, Q 2... .Q i, and Q 1+ Q 2+ ... .+Q i=1;
The described first mail recognition unit is used for
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P i* Q i, calculate the weighting spam probability of described mail, with described weighting spam probability and second threshold value that presets relatively,, then described mail is labeled as spam if be higher than described second threshold value.
15. device as claimed in claim 11 is characterized in that, described type judging unit specifically comprises: second distance probability calculation unit, the 3rd selected cell;
Described second distance probability calculation unit, be used for according to described external manifestation characteristic value, calculate the distance L that described mail branch is clipped to predetermined n center mail, and according to the distance L of described mail to described predetermined n center mail, according to predetermined apart from the probability calculation rule, calculate described mail to described predetermined n center mail apart from probability Q 1, Q 2... .Q n, and Q 1+ Q 2+ ... .+Q n=1;
Described the 3rd selected cell is used for described apart from probability sorting, descendingly will select to make apart from probability with first greater than the center mail of being scheduled to apart from the probability threshold value apart from the probability addition successively.
16. as claim 15 or 16 described devices, it is characterized in that,
Described mail recognition unit specifically is used for:
Foundation: weighting spam probability=P 1* Q 1+ P 2* Q 2... .+P j* Q j, calculate the weighting spam probability of described mail, with weighting spam probability and predetermined threshold value relatively,, then described mail is labeled as spam if greater than described predetermined threshold value, j is the quantity of selected center mail.
CN2007101546412A 2007-09-20 2007-09-20 Mail identifying method and apparatus Active CN101119341B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101546412A CN101119341B (en) 2007-09-20 2007-09-20 Mail identifying method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101546412A CN101119341B (en) 2007-09-20 2007-09-20 Mail identifying method and apparatus

Publications (2)

Publication Number Publication Date
CN101119341A true CN101119341A (en) 2008-02-06
CN101119341B CN101119341B (en) 2011-02-16

Family

ID=39055279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101546412A Active CN101119341B (en) 2007-09-20 2007-09-20 Mail identifying method and apparatus

Country Status (1)

Country Link
CN (1) CN101119341B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917352A (en) * 2010-06-12 2010-12-15 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof
CN101321365B (en) * 2008-07-17 2011-12-28 浙江大学 Rubbish message sending user identification method by message reply frequency
CN102377690A (en) * 2011-10-10 2012-03-14 网易(杭州)网络有限公司 Anti-spam gateway system and method
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training
CN104484351A (en) * 2014-11-28 2015-04-01 上海百事通信息技术股份有限公司 Large data volume number filtering device and method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8176125B2 (en) * 2002-02-22 2012-05-08 Access Company, Ltd. Method and device for processing electronic mail undesirable for user
US7519668B2 (en) * 2003-06-20 2009-04-14 Microsoft Corporation Obfuscation of spam filter
CN100556039C (en) * 2006-01-13 2009-10-28 腾讯科技(深圳)有限公司 Eliminate the method and system of spam erroneous judgement
CN100583840C (en) * 2006-12-12 2010-01-20 华南理工大学 Spam mail identify method based on interest cognition and system thereof

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101321365B (en) * 2008-07-17 2011-12-28 浙江大学 Rubbish message sending user identification method by message reply frequency
CN101917352A (en) * 2010-06-12 2010-12-15 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof
CN102377690A (en) * 2011-10-10 2012-03-14 网易(杭州)网络有限公司 Anti-spam gateway system and method
CN102404249A (en) * 2011-11-18 2012-04-04 北京语言大学 Method and device for filtering junk emails based on coordinated training
CN102404249B (en) * 2011-11-18 2014-04-09 北京语言大学 Method and device for filtering junk emails based on coordinated training
CN104484351A (en) * 2014-11-28 2015-04-01 上海百事通信息技术股份有限公司 Large data volume number filtering device and method
CN104484351B (en) * 2014-11-28 2018-07-20 上海百事通信息技术股份有限公司 Big data quantity number filtering device and method

Also Published As

Publication number Publication date
CN101119341B (en) 2011-02-16

Similar Documents

Publication Publication Date Title
CN101119341B (en) Mail identifying method and apparatus
US7930353B2 (en) Trees of classifiers for detecting email spam
CN101359373B (en) Method and device for recognizing degraded character
CN104536953B (en) A kind of recognition methods of text emotional valence and device
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
CN101887523B (en) Method for detecting image spam email by picture character and local invariant feature
CN101316246A (en) Junk mail detection method and system based on dynamic update of categorizer
CN101604322B (en) Decision level text automatic classified fusion method
CN105447505B (en) A kind of multi-level important email detection method
CN101227435A (en) Method for filtering Chinese junk mail based on Logistic regression
CN103488689B (en) Process for sorting mailings and system based on cluster
CN1889108A (en) Method of identifying junk mail
CN101256631A (en) Method, apparatus, program and readable storage medium for character recognition
CN102377690B (en) Anti-spam gateway system and method
CN101604394A (en) Increment study classification method under a kind of limited storage resources
US11908220B2 (en) System and method for automatically recognizing delivery point information
CN106230690B (en) A kind of process for sorting mailings and system of combination user property
CN1228655A (en) Data distribution system and data distribution method
CN102096809B (en) Handwriting identification method based on local outline structure coding
CN103684971B (en) Method and system for processing mails
CN108920694A (en) A kind of short text multi-tag classification method and device
CN107545387B (en) Express delivery station health degree detection method based on machine learning
US11954903B2 (en) System and method for automatically recognizing delivery point information
CN102799666B (en) Method for automatically categorizing texts of network news based on frequent term set
CN1987909B (en) Method, System and device for purifying Bayes spam

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant