CN103001848A - Spam filtering method and spam filtering device - Google Patents

Spam filtering method and spam filtering device Download PDF

Info

Publication number
CN103001848A
CN103001848A CN2011102643651A CN201110264365A CN103001848A CN 103001848 A CN103001848 A CN 103001848A CN 2011102643651 A CN2011102643651 A CN 2011102643651A CN 201110264365 A CN201110264365 A CN 201110264365A CN 103001848 A CN103001848 A CN 103001848A
Authority
CN
China
Prior art keywords
mail
email
situation
spam
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102643651A
Other languages
Chinese (zh)
Other versions
CN103001848B (en
Inventor
郭涛
于洪涌
薛立宏
丘凌
张国威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201110264365.1A priority Critical patent/CN103001848B/en
Publication of CN103001848A publication Critical patent/CN103001848A/en
Application granted granted Critical
Publication of CN103001848B publication Critical patent/CN103001848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a spam filtering method. The spam filtering method includes: when E-mails are received, scanning whether hit preset fuzzy words and hit fuzzy words of items in a context recognition library exist in the content or not; if hit items exist, performing context analysis for the E-mails, and adjusting according to corresponding contexts of the E-mails to obtain mail value vectors of the E-mails; and computing the spam probability according to the mail value vectors of the E-mails, comparing the spam probability with a preset threshold value to judge whether the E-mails are suspicious spams or not, and intercepting the E-mails which are confirmed to be the suspicious spams. The invention further relates to a spam filtering device. By the method and the device, the spams including the fuzzy words are intercepted on the basis of fuzzy word recognition and context analysis, intercepting range of the spams is widened greatly while filtering accuracy degree is guaranteed, and an existing spam intercepting mode giving priority to keyword filtering is further supplemented and optimized.

Description

Rubbish mail filtering method and device
Technical field
The present invention relates to anti-spam technologies, relate in particular to a kind of rubbish mail filtering method and device.
Background technology
SPAM (abbreviation spam) refers to without user's license with regard to any Email in the mailbox that sends to by force the user.Email is one of base application of present Internet user, and spam mainly sends by the E-mail address.In December, 2010, Monitoring Data showed, the SPAM quantity of whole world transmission every day is about 50,000,000,000.The content of spam comprises promotional advertising, adult's advertisement, money-making information, and comprise the destructive Emails such as computer virus, brought many puzzlements for the Email user, so each large mail provider is all promoting the anti-garbage system effect of Email as the significant concern point that promotes mailbox user experience.
Anti-garbage system commonly used is filtered by predefined keyword technology, it is Keyword List of first predefine, then crawl content and Keyword List compares from the mail that passes through, if having hit carry out corresponding spam interception and move.Although it is fairly simple that this simple Keyword List matching way is realized, be easy to be evaded by inserting modes such as disturbing character, use homophone, use nearly word form by spammer, and then make Spam Filtering System lose efficacy.
In addition, the scheme of simple keyword filtration also is weak on the ability of identification normal email, may with the interception of part normal email mistake, have influence on Email user's normal use.
Summary of the invention
The objective of the invention is to propose a kind of rubbish mail filtering method and device, can in the interception scope that improves spam, guarantee the accuracy of Spam filtering.
For achieving the above object, the invention provides a kind of rubbish mail filtering method, comprising:
When receiving Email, scan whether to exist in the content of described Email and hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word;
If there is hit entries, then described Email is carried out scenario analysis, and corresponding situation adjustment obtains the mail value vector of described Email according to described Email;
Mail value vector calculation spam probability according to the described Email that obtains after adjusting, and described spam probability and predetermined threshold value compared, whether judging described Email as suspicious spam, and the Email that is defined as suspicious spam is carried out intercept process.
For achieving the above object, the invention provides a kind of junk mail filter device, comprising:
The Email receiving element is used for receiving Email;
The fuzzy word scanning element, whether the content that is used for scanning described Email exists is hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word;
The scenario analysis unit is used for when having hit entries described Email being carried out scenario analysis;
The vector adjustment unit is used for the mail value vector that according to described Email corresponding situation adjustment obtains described Email;
Spam probability calculation unit is used for the mail value vector calculation spam probability according to the described Email that obtains after adjusting;
The threshold value comparing unit is used for described spam probability and predetermined threshold value are compared, to judge that whether described Email is as suspicious spam;
The mail treatment unit is used for the Email that is defined as suspicious spam is carried out intercept process.
Based on technique scheme, the present invention is based on fuzzy word identification and scenario analysis, to comprising the intercepting junk mail of fuzzy word, when greatly improving the interception scope of spam, guarantee the accuracy of filtration, also existing spam interception mode in the keyword filtration mode is provided and further replenished and optimization.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 is the schematic flow sheet of an embodiment of rubbish mail filtering method of the present invention.
Fig. 2 is the schematic flow sheet of setting up fuzzy word and context recognition storehouse among another embodiment of rubbish mail filtering method of the present invention.
Fig. 3 is the schematic flow sheet of the another embodiment of rubbish mail filtering method of the present invention.
Fig. 4 is the structural representation of an embodiment of junk mail filter device of the present invention.
Fig. 5 is the structural representation of realizing building the correlation unit of storehouse process among another embodiment of junk mail filter device of the present invention.
Fig. 6 is the structural representation of the another embodiment of junk mail filter device of the present invention.
Embodiment
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
The present invention increases the spam method for sorting of fuzzy word (comprise homonym, the nearly word of shape, split word etc.) identification on the existing keyword interception of anti-garbage mail system basis, so that the spam that interception is processed through fuzzy word.The present invention is in identifying, mail is carried out fuzzy word and situation processing, in processing procedure, consider the interference symbol situation, fuzzy word hit situation of mail, corresponding situation assistant analysis etc., based on vector operation and probability, mail is sorted, and according to result system is optimized.
As shown in Figure 1, be the schematic flow sheet of an embodiment of rubbish mail filtering method of the present invention.In the present embodiment, rubbish mail filtering method comprises:
Step 101, when receiving Email, scan whether to exist in the content of described Email and hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word;
If there is hit entries in step 102, then described Email is carried out scenario analysis, and corresponding situation adjustment obtains the mail value vector of described Email according to described Email;
Step 103, according to the mail value vector calculation spam probability of the described Email that obtains after adjusting;
Step 104, described spam probability and predetermined threshold value are compared, to judge that whether described Email is as suspicious spam;
Step 105, the Email that is defined as suspicious spam is carried out intercept process.
In the present embodiment, when receiving Email, carry out first the inquiry of clauses and subclauses in the storehouse, the clauses and subclauses that comprise in fuzzy word and context recognition storehouse are the corresponding relation of fuzzy word and existing rubbish keyword and accordingly with reference to mail value vector, also comprise fuzzy word and existing rubbish keyword under a plurality of situations corresponding relation affect probability.In the storehouse in the clauses and subclauses query script, mainly be to search in the content of Email whether have fuzzy word identical in the storehouse, locate hit entries with this.
After having hit certain clauses and subclauses, need to carry out scenario analysis to this Email, this analytic process can specifically comprise: analyze the situation element that obtains described Email, the situation element that the situation element of the described Email that obtains and the various situations in the hit entries are included mates, and determines the situation that described Email is corresponding.
The situation element here can comprise some words that comprise in the transmitting time, Mail Contents of mail, mailbox domain name of sender etc., but this that is not limited to give an example is several, these situation elements can give expression to different situations by combination, corresponding to different situations, the Email that certain fuzzy word occurs belongs to corresponding the increasing or reduction of probability of SPAM.For example: if the transmitting time that analyzes certain envelope Email is before and after the Mid-autumn Festival, " recovery " appearred in mail, and the fuzzy word of this hit entries is " moon also ", these situation elements can be determined the scene that moon cake reclaims before and after the Mid-autumn Festival substantially, be considered to belong to a kind of scope of improper mail, its probability as spam has just increased.
After analyzing the corresponding situation of this Email, just can utilize the operation that probability is adjusted that affects of corresponding situation in the hit entries, specifically comprise: the situation that the described Email of determining according to scenario analysis is corresponding is inquired about the corresponding probability that affects, by the described probability that affects reference mail value vector corresponding to described hit entries adjusted, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and disturbs the symbol score value.This adjusting operation need to be adjusted at reference mail value vector corresponding to hit entries, and the mail value vector after the adjustment is as the mail value vector of this Email.
Behind the mail value vector that obtains this Email, to continue to calculate the spam probability according to this mail value vector, computational process mainly is with the keyword score value in the mail value vector of described Email and replaces product and the situation score value of score value and disturb the symbol score value to add up, obtains spam probability corresponding to described Email.The technical staff can determine the variable in the situation adjustment computing formula or use new computing formula according to the reality of result of calculation and spam, and this computing formula is not the restriction to protection range only in order to illustrate.
After calculating the spam probability, compare by default threshold value, can judge whether this Email is suspicious spam, the spam probability that for example ought calculate is greater than predetermined threshold value, determine that then this Email is suspicious spam, if the spam probability that this mail calculates, is then determined this Email less than or equal to predetermined threshold value and is got rid of spam suspicion, can normally drop.When judging, also can be during more than or equal to predetermined threshold value, to determine that this Email is suspicious spam, if the spam probability that this mail calculates is less than predetermined threshold value at the spam probability that calculates, then determine this Email eliminating spam suspicion, can normally drop.
As shown in Figure 2, set up the schematic flow sheet in fuzzy word and context recognition storehouse among another embodiment for rubbish mail filtering method of the present invention.Compare with a upper embodiment, the present embodiment also comprised the operation of setting up fuzzy word and context recognition storehouse before step 101, specifically comprise:
Step 201, set up described fuzzy word and context recognition storehouse;
Corresponding relation between step 202, the existing rubbish keyword of basis and the fuzzy word adds clauses and subclauses in described fuzzy word and context recognition storehouse;
Step 203, to calculate corresponding relation between described existing rubbish keyword and the fuzzy word according to the historical data in the anti-garbage mail system corresponding with reference to mail value vector, described with reference to mail value vector comprise with reference to the keyword score value, with reference to replace score value, with reference to the situation score value with reference to disturbing the symbol score value;
Step 204, add in clauses and subclauses under the multiple situation the probability that affects of the corresponding relation between described existing rubbish keyword and the fuzzy word, described situation comprises at least one situation element.
In the present embodiment, each clauses and subclauses in fuzzy word and the context recognition storehouse comprise the corresponding relation between mimetic word and the existing rubbish keyword, and this corresponding relation is corresponding with reference to mail value vector, and under the multiple situation on the probability that affects of this corresponding relation.Wherein, need to calculate according to the historical data of anti-garbage mail system with reference to mail value vector, its embodiment be the corresponding situation of this fuzzy word and existing rubbish keyword in present data with existing.
Comprise with reference to keyword score value Key0, with reference to replacing score value eXchange0, disturbing symbol score value DisturbMark0 with reference to situation score value Content0 and reference with reference to mail value vector eMailValue0.The average probability of spam when wherein representing to comprise in the mail this existing rubbish keyword with reference to keyword score value Key0; When representing to occur this fuzzy word in the mail with reference to replacement score value eXchange0, this fuzzy word is actual to be the average probability of replacing existing rubbish keyword; The average influence rate of the spam probability when representing that with reference to situation score value Content0 the mail context situation substitutes keyword to appearance in the mail with fuzzy word, this value can just can be born; If reference disturbs symbol score value DisturbMark0 to represent to occur in the mail substituting keyword with fuzzy word, it comprises the interference symbol is the average probability amplification of spam to mail.
When adjustment obtains the mail value vector of Email with reference to mail value vector, replace score value eXchange0 and determined by historical data with reference to keyword score value Key0, reference, do not need to adjust, in other words, replace score value eXchange0 respectively as the keyword score value Key in the mail value vector of this Email and replacement score value eXchange with reference to keyword score value Key0, reference.And situation score value Content needs to come reference situation score value Content0 is adjusted according to the corresponding probability P c that affects of situation that analyzes, and the result after the adjustment is as the situation score value Content in the mail value vector of this Email.
Mentioned in front and affected probability P c and can just can bear, if the spam probability possibility of expressing under this situation when occurring substituting keyword with fuzzy word in this mail with positive number is larger, the spam probability possibility of expressing under this situation when occurring substituting keyword with fuzzy word in this mail with negative is less, can utilize so simple algorithm calculations to adjust Content0, for example directly with Content0 with affect probability P c or affect the multiple addition of probability P c, the Content after being adjusted.
Another kind of situation is feasible too, spam probability possibility when substituting keyword with appearance in this mail under this situation of larger numerical expression with fuzzy word is larger, spam probability possibility when substituting keyword with appearance in this mail under this situation of less numerical expression with fuzzy word is less, need accordingly to select suitable account form to adjust Content0, for example will affect difference and the Content0 addition of probability P c and certain constant, the Content after being adjusted.Here given several adjustment examples only are convenient to better understanding, are not certain concrete adjustment mode is limited.
In above-mentioned each embodiment, disturb symbol if there are some in the Email, will have influence on fuzzy word and context recognition storehouse discal patch purpose coupling, the symbol of the interference here is often referred to composing, the symbol that is inserted in the non-language in the vocabulary, for example newline, " | ", "/", " # " "! " etc.; also comprise space etc.; so that this vocabulary becomes and is difficult to be identified by computer; in order to improve discrimination, after receiving Email, scan whether to exist in the content of described Email and hit before default fuzzy word and the context recognition storehouse discal patch purpose fuzzy word; also the non-language in this Email is partly disturbed the symbol Transformatin; by deleting these non-linguistic notations, so that the words in the mail is more coherent, thereby improve discrimination.
In addition, disturb symbol if there are many places in the envelope mail, represent that then this mail is that the likelihood ratio of spam is higher, therefore when adjusting mail value vector, can be according to whether existing interference to accord with, disturb the symbol frequency of occurrences, disturb factors such as according with occurrence number to disturb symbol score value DisturbMark0 to adjust to reference, thereby accord with the impact that causes so that the spam probability that finally calculates can embody to disturb.
After the Email that is defined as suspicious spam is carried out intercept process, also according to this judged result in conjunction with historical data recomputate described existing rubbish keyword and the corresponding relation between the fuzzy word corresponding with reference to mail value vector sum situation corresponding affect probability, and upgrade corresponding clauses and subclauses in described fuzzy word and the context recognition storehouse, thereby the filter process according to spam constantly upgrades clauses and subclauses, so that follow-up filter process more tallies with the actual situation, its filter result is more accurate.
As shown in Figure 3, be the schematic flow sheet of the another embodiment of rubbish mail filtering method of the present invention.Provided in the present embodiment a more specific Spam filtering example, having comprised:
Step 301, receive Email;
Step 302, the non-language in the described Email is partly disturbed the symbol Transformatin;
Whether exist in the content of step 303, the described Email of scanning and hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word, have then execution in step 304, otherwise execution in step 310;
Step 304, analysis obtain the situation element of described Email;
The included situation element of the situation element of step 305, the described Email that will obtain and the various situations in the hit entries mates, and determines the situation that described Email is corresponding;
Step 306, the situation that the described Email of determining according to scenario analysis is corresponding are inquired about the corresponding probability that affects;
Step 307, by the described probability that affects reference mail value vector corresponding to described hit entries adjusted, obtained the mail value vector of described Email;
Step 308, with the keyword score value in the mail value vector of described Email and the product of replacing score value with the situation score value and disturb and accord with score value and add up, obtain spam probability corresponding to described Email;
Step 309, spam probability and the predetermined threshold value that obtains being compared, whether to judge described Email as suspicious spam, is execution in step 310 then, otherwise execution in step 311;
Step 310, the Email that is defined as suspicious spam is carried out intercept process, and execution in step 312;
Step 311, this Email is dropped as normal email;
Step 312, the threshold value of the clauses and subclauses in fuzzy word and the context recognition storehouse and spam probabilistic determination being used according to the result of this Email are optimized adjustment.
In above-mentioned each embodiment, if there are a plurality of hit entries, then can carry out fuzzy word to each hit entries respectively processes, calculate spam probability and threshold ratio, and comprehensively the resulting conclusion of each hit entries whether carry out described Email be the judgement of suspicious spam.
The embodiment of the invention can be used as replenishing of keyword interception mode, for example judge simultaneously with the keyword interception mode, comprehensive conclusion is judged again, also can go out the identification etc. that spam proceeds to comprise the spam of fuzzy word in the situation that the keyword interception mode is unidentified, thereby enlarge the interception scope of anti-garbage mail system.
By the present invention, the mail that can make those attempt to disturb character by inserting, to use homophone, uses the mode such as nearly word form to evade filtration is not easy to get by under false pretences, increase the interception scope of anti-garbage mail system, utilize simultaneously scenario analysis to increase the situation dimension for the differentiation of spam, improve spam sorting precision, avoided the erroneous judgement of spam.In certain embodiments, also utilize the feedback adjusting after the mail treatment, make the continuous self-optimization of this rubbish intercepting system, avoid system to solidify the new variation that is not suitable with spam.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be finished by the relevant hardware of program command, aforesaid program can be stored in the computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: the various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
As shown in Figure 4, be the structural representation of an embodiment of junk mail filter device of the present invention.In the present embodiment, junk mail filter device comprises: Email receiving element 11, fuzzy word scanning element 12, scenario analysis unit 13, vectorial adjustment unit 14, probability calculation unit 15, threshold value comparing unit 16 and mail treatment unit 17.
Email receiving element 11 is responsible for receiving Email.Whether exist in the content of the described Email of fuzzy word scanning element 12 responsible scannings and hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word.Scenario analysis unit 13 is responsible for when having hit entries, and described Email is carried out scenario analysis.Vector adjustment unit 14 is responsible for the mail value vector that according to described Email corresponding situation adjustment obtains described Email.
The mail value vector calculation spam probability according to the described Email that obtains after adjusting is responsible in probability calculation unit 15.Threshold value comparing unit 16 is responsible for described spam probability and predetermined threshold value are compared, to judge that whether described Email is as suspicious spam.Mail treatment unit 17 is responsible for the Email that is defined as suspicious spam is carried out intercept process.
As shown in Figure 5, build the structural representation of the correlation unit of storehouse process for realization among another embodiment of junk mail filter device of the present invention.Compare with a upper embodiment, increased in the present embodiment and realized building the correlation unit of storehouse process, comprising: build library unit 21, clauses and subclauses adding device 22, reference vector computing unit 23 and situation probability adding device 24.Wherein, build library unit 21 and be responsible for setting up described fuzzy word and context recognition storehouse.Clauses and subclauses adding device 22 is responsible for adding clauses and subclauses according to the corresponding relation between existing rubbish keyword and the fuzzy word in described fuzzy word and context recognition storehouse.It is corresponding with reference to mail value vector that reference vector computing unit 23 is responsible for the corresponding relation that calculates between described existing rubbish keyword and the fuzzy word according to the historical data in the anti-garbage mail system, described with reference to mail value vector comprise with reference to the keyword score value, with reference to replace score value, with reference to the situation score value with reference to disturbing the symbol score value.Situation probability adding device 24 is responsible in clauses and subclauses adding under the multiple situation the probability that affects of the corresponding relation between described existing rubbish keyword and the fuzzy word, and described situation comprises at least one situation element.
As shown in Figure 6, be the structural representation of the another embodiment of junk mail filter device of the present invention.The present embodiment can also increase interference symbol removal unit 17 than device embodiment before, link to each other with described fuzzy word scanning element 12 with described Email receiving element 11, this unit partly disturbs the symbol Transformatin to the non-language in the described Email before being responsible for whether existing and hitting default fuzzy word and context recognition storehouse discal patch purpose fuzzy word in the content of the described Email of scanning.
Scenario analysis unit 13 may further include: situation elementary analysis assembly 13a, situation Match of elemental composition assembly 13b and situation are determined assembly 13c.Wherein, situation elementary analysis assembly 13a is responsible for analyzing the situation element that obtains described Email.Situation Match of elemental composition assembly 13b is responsible for the included situation element of the situation element of the described Email that will obtain and the various situations in the hit entries and mates.Situation determines that assembly 13c is responsible for determining the situation that described Email is corresponding according to match condition.
Vector adjustment unit 14 may further include: affect probabilistic query assembly 14a and vector adjustment assembly 14b.Wherein, affecting probabilistic query assembly 14a is responsible for inquiring about the corresponding probability that affects according to situation corresponding to the definite described Email of scenario analysis.Vector is adjusted assembly 14b and is responsible for by the described probability that affects reference mail value vector corresponding to described hit entries being adjusted, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and disturbs the symbol score value.
In another embodiment, can also in junk mail filter device, increase storehouse updating block 18, this unit links to each other with mail treatment unit 17, be responsible for judging that described Email is suspicious spam, and carry out after the intercept process, according to this judged result in conjunction with historical data recomputate described existing rubbish keyword and the corresponding relation between the fuzzy word corresponding with reference to mail value vector sum situation corresponding affect probability, and upgrade corresponding clauses and subclauses in described fuzzy word and the context recognition storehouse.
If there are a plurality of hit entries, then for each hit entries, pass through respectively described scenario analysis unit, vectorial adjustment unit, spam probability calculation unit and threshold ratio than cell processing, and comprehensively the resulting conclusion of each hit entries whether carry out described Email be the judgement of suspicious spam.
By the present invention, the mail that can make those attempt to disturb character by inserting, to use homophone, uses the mode such as nearly word form to evade filtration is not easy to get by under false pretences, increase the interception scope of anti-garbage mail system, utilize simultaneously scenario analysis to increase the situation dimension for the differentiation of spam, improve spam sorting precision, avoided the erroneous judgement of spam.In certain embodiments, also utilize the feedback adjusting after the mail treatment, make the continuous self-optimization of this rubbish intercepting system, avoid system to solidify the new variation that is not suitable with spam.
Should be noted that at last: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment the present invention is had been described in detail, those of ordinary skill in the field are to be understood that: still can make amendment or the part technical characterictic is equal to replacement the specific embodiment of the present invention; And not breaking away from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope that the present invention asks for protection.

Claims (15)

1. rubbish mail filtering method comprises:
When receiving Email, scan whether to exist in the content of described Email and hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word;
If there is hit entries, then described Email is carried out scenario analysis, and corresponding situation adjustment obtains the mail value vector of described Email according to described Email;
Mail value vector calculation spam probability according to the described Email that obtains after adjusting, and described spam probability and predetermined threshold value compared, whether judging described Email as suspicious spam, and the Email that is defined as suspicious spam is carried out intercept process.
2. rubbish mail filtering method according to claim 1 wherein before receiving Email, also comprises the operation of setting up fuzzy word and context recognition storehouse, specifically comprises:
Set up described fuzzy word and context recognition storehouse;
In described fuzzy word and context recognition storehouse, add clauses and subclauses according to the corresponding relation between existing rubbish keyword and the fuzzy word;
Corresponding with reference to mail value vector according to the corresponding relation that the historical data in the anti-garbage mail system is calculated between described existing rubbish keyword and the fuzzy word, replace score value, disturb the symbol score value with reference to situation score value and reference described comprising with reference to keyword score value, reference with reference to mail value vector;
Add in clauses and subclauses under the multiple situation the probability that affects of the corresponding relation between described existing rubbish keyword and the fuzzy word, described situation comprises at least one situation element.
3. rubbish mail filtering method according to claim 2, the wherein said operation that Email is carried out scenario analysis specifically comprises:
Analyze the situation element that obtains described Email;
The situation element that the situation element of the described Email that obtains and the various situations in the hit entries are included mates, and determines the situation that described Email is corresponding.
4. rubbish mail filtering method according to claim 3, the operation that wherein corresponding situation adjustment obtains the mail value vector of described Email according to described Email specifically comprises:
The situation that the described Email of determining according to scenario analysis is corresponding is inquired about the corresponding probability that affects;
By the described probability that affects reference mail value vector corresponding to described hit entries adjusted, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and disturbs the symbol score value.
5. rubbish mail filtering method according to claim 4, the operation of the mail value vector calculation spam probability of wherein said described Email according to obtaining after adjusting is specially:
Keyword score value in the mail value vector of described Email and the product of replacing score value with the situation score value and disturb and accord with score value and add up, are obtained spam probability corresponding to described Email.
6. arbitrary described rubbish mail filtering method according to claim 1~5, wherein in the content of the described Email of scanning, whether exist hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word before, also comprise:
Non-language in the described Email is partly disturbed the symbol Transformatin.
7. rubbish mail filtering method according to claim 6, wherein judging that described Email is suspicious spam, and carry out after the intercept process, also according to this judged result in conjunction with historical data recomputate described existing rubbish keyword and the corresponding relation between the fuzzy word corresponding with reference to mail value vector sum situation corresponding affect probability, and upgrade corresponding clauses and subclauses in described fuzzy word and the context recognition storehouse.
8. rubbish mail filtering method according to claim 1, if wherein there are a plurality of hit entries, then respectively each hit entries being carried out fuzzy word processes, calculate spam probability and threshold ratio, and comprehensively the resulting conclusion of each hit entries whether carry out described Email be the judgement of suspicious spam.
9. junk mail filter device comprises:
The Email receiving element is used for receiving Email;
The fuzzy word scanning element, whether the content that is used for scanning described Email exists is hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word;
The scenario analysis unit is used for when having hit entries described Email being carried out scenario analysis;
The vector adjustment unit is used for the mail value vector that according to described Email corresponding situation adjustment obtains described Email;
The probability calculation unit is used for the mail value vector calculation spam probability according to the described Email that obtains after adjusting;
The threshold value comparing unit is used for described spam probability and predetermined threshold value are compared, to judge that whether described Email is as suspicious spam;
The mail treatment unit is used for the Email that is defined as suspicious spam is carried out intercept process.
10. junk mail filter device according to claim 9 wherein also comprises:
Build library unit, be used for setting up described fuzzy word and context recognition storehouse;
The clauses and subclauses adding device is used for adding clauses and subclauses according to the corresponding relation between existing rubbish keyword and the fuzzy word in described fuzzy word and context recognition storehouse;
The reference vector computing unit, corresponding vectorial with reference to the mail value for the corresponding relation that calculates according to the historical data of anti-garbage mail system between described existing rubbish keyword and the fuzzy word, replace score value, accord with score value with reference to the situation score value with reference to interference described comprising with reference to keyword score value, reference with reference to mail value vector;
Situation probability adding device is used for adding under the multiple situation the probability that affects of the corresponding relation between described existing rubbish keyword and the fuzzy word in clauses and subclauses, and described situation comprises at least one situation element.
11. junk mail filter device according to claim 10, wherein said scenario analysis unit further comprises:
Situation elementary analysis assembly is used for analyzing the situation element that obtains described Email;
Situation Match of elemental composition assembly, the included situation element of various situations that is used for the situation element of the described Email that will obtain and hit entries mates;
Situation is determined assembly, is used for determining the situation that described Email is corresponding according to match condition.
12. junk mail filter device according to claim 11, wherein said vectorial adjustment unit further comprises:
Affect the probabilistic query assembly, situation corresponding to described Email that is used for determining according to scenario analysis inquired about the corresponding probability that affects;
Vector is adjusted assembly, be used for by the described probability that affects reference mail value vector corresponding to described hit entries being adjusted, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and disturbs the symbol score value.
13. arbitrary described junk mail filter device according to claim 9~12 wherein also comprises:
Disturb symbol to remove the unit, link to each other with described fuzzy word scanning element with described Email receiving element, be used for content at the described Email of scanning whether exist hit default fuzzy word and context recognition storehouse discal patch purpose fuzzy word before, the non-language in the described Email is partly disturbed the symbol Transformatin.
14. junk mail filter device according to claim 13 wherein also comprises:
The storehouse updating block, link to each other with described mail treatment unit, be used for judging that described Email is suspicious spam, and carry out after the intercept process, according to this judged result in conjunction with historical data recomputate described existing rubbish keyword and the corresponding relation between the fuzzy word corresponding with reference to mail value vector sum situation corresponding affect probability, and upgrade corresponding clauses and subclauses in described fuzzy word and the context recognition storehouse.
15. junk mail filter device according to claim 9, if wherein there are a plurality of hit entries, then for each hit entries, pass through respectively described scenario analysis unit, vectorial adjustment unit, spam probability calculation unit and threshold ratio than cell processing, and comprehensively the resulting conclusion of each hit entries whether carry out described Email be the judgement of suspicious spam.
CN201110264365.1A 2011-09-08 2011-09-08 Rubbish mail filtering method and device Active CN103001848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110264365.1A CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110264365.1A CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Publications (2)

Publication Number Publication Date
CN103001848A true CN103001848A (en) 2013-03-27
CN103001848B CN103001848B (en) 2015-10-21

Family

ID=47930004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110264365.1A Active CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Country Status (1)

Country Link
CN (1) CN103001848B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716335A (en) * 2014-01-12 2014-04-09 绵阳师范学院 Detecting and filtering method of spam mail based on counterfeit sender
CN103944810A (en) * 2014-05-06 2014-07-23 厦门大学 Spam e-mail intention recognition system
CN111563721A (en) * 2020-04-21 2020-08-21 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200516484A (en) * 2003-10-27 2005-05-16 Softnext Technologies Co Ltd Filtering method for SPAM
US20060047760A1 (en) * 2004-08-27 2006-03-02 Susan Encinas Apparatus and method to identify SPAM emails
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system
US7680890B1 (en) * 2004-06-22 2010-03-16 Wei Lin Fuzzy logic voting method and system for classifying e-mail using inputs from multiple spam classifiers
US20100106677A1 (en) * 2004-03-09 2010-04-29 Gozoom.Com, Inc. Email analysis using fuzzy matching of text

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200516484A (en) * 2003-10-27 2005-05-16 Softnext Technologies Co Ltd Filtering method for SPAM
US20100106677A1 (en) * 2004-03-09 2010-04-29 Gozoom.Com, Inc. Email analysis using fuzzy matching of text
US7680890B1 (en) * 2004-06-22 2010-03-16 Wei Lin Fuzzy logic voting method and system for classifying e-mail using inputs from multiple spam classifiers
US20060047760A1 (en) * 2004-08-27 2006-03-02 Susan Encinas Apparatus and method to identify SPAM emails
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
罗倩等: "反垃圾邮件技术综述", 《渤海大学学报(自然科学版)》 *
胡锡衡: "垃圾邮件过滤系统模型的研究与设计", 《鞍山师范学院学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716335A (en) * 2014-01-12 2014-04-09 绵阳师范学院 Detecting and filtering method of spam mail based on counterfeit sender
CN103944810A (en) * 2014-05-06 2014-07-23 厦门大学 Spam e-mail intention recognition system
CN103944810B (en) * 2014-05-06 2017-02-15 厦门大学 Spam e-mail intention recognition system
CN111563721A (en) * 2020-04-21 2020-08-21 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions
CN111563721B (en) * 2020-04-21 2023-07-11 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions

Also Published As

Publication number Publication date
CN103001848B (en) 2015-10-21

Similar Documents

Publication Publication Date Title
US20210344632A1 (en) Detection of spam messages
CN101166159B (en) A method and system for identifying rubbish information
CN107423883B (en) Risk identification method and device for to-be-processed service and electronic equipment
AU2012367398B2 (en) Systems and methods for spam detection using character histograms
CN102024045B (en) Information classification processing method, device and terminal
JP5941163B2 (en) Spam detection system and method using frequency spectrum of character string
Lekha et al. Data mining techniques in detecting and predicting cyber crimes in banking sector
CN111275546A (en) Financial client fraud risk identification method and device
JP7049087B2 (en) Technology to detect suspicious electronic messages
RU2763921C1 (en) System and method for creating heuristic rules for detecting fraudulent emails attributed to the category of bec attacks
CN111314063A (en) Big data information management method, system and device based on Internet of things
US11929969B2 (en) System and method for identifying spam email
CN103001848A (en) Spam filtering method and spam filtering device
Gupta Spam mail filtering using data mining approach: A comparative performance analysis
WO2019242441A1 (en) Dynamic feature-based malware recognition method and system and related apparatus
CN111258796A (en) Service infrastructure and method of predicting and detecting potential anomalies therein
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
CN110972086A (en) Short message processing method and device, electronic equipment and computer readable storage medium
CN115842677A (en) Self-adaptive mail security detection method and device
Vejendla et al. Score based Support Vector Machine for Spam Mail Detection
Manek et al. RePID-OK: spam detection using repetitive pre-processing
Parmar et al. Utilising machine learning against email phishing to detect malicious emails
WO2020215123A1 (en) Mitigation of phishing risk
CN113765852B (en) Data packet detection method, system, storage medium and computing device
CN110138723A (en) The determination method and system of malice community in a kind of mail network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant