CN102377690A - Anti-spam gateway system and method - Google Patents

Anti-spam gateway system and method Download PDF

Info

Publication number
CN102377690A
CN102377690A CN2011103044703A CN201110304470A CN102377690A CN 102377690 A CN102377690 A CN 102377690A CN 2011103044703 A CN2011103044703 A CN 2011103044703A CN 201110304470 A CN201110304470 A CN 201110304470A CN 102377690 A CN102377690 A CN 102377690A
Authority
CN
China
Prior art keywords
mail
sample
module
features
similitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103044703A
Other languages
Chinese (zh)
Other versions
CN102377690B (en
Inventor
蔡瑞初
向东
熊卫华
洪陆驾
谭景峰
乔斌
潘雷明
周达和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201110304470.3A priority Critical patent/CN102377690B/en
Publication of CN102377690A publication Critical patent/CN102377690A/en
Application granted granted Critical
Publication of CN102377690B publication Critical patent/CN102377690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an anti-spam gateway system and an anti-spam method. The system comprises a mail sample database for storing various mail samples, and a mail characteristic exploration module for acquiring the mail samples from the mail sample database, comparing the mail samples with all central points, and directly adding the samples into the central points if the similarity between the mail samples and all the central points is less than a certain threshold value, wherein each central point represents a type of samples; when the similarity between the mail samples and the central points is calculated, the mail samples and the central points are resolved into a plurality of parts of contents respectively; for each part, the similarities of the mail sample and the central point are compared; and the global similarity between the mail samples and the central points can be acquired by weighted combination of the similarities of the parts. By using the system and the method, the sample database and a characteristic database have better adaptability to burst spam types and the like; therefore, the leakage rate of spam is low, the instantaneity is high, the manual intervention is low and the system contractability is high.

Description

Anti-rubbish mail gateway system and method
Technical field
The present invention relates to the email disposal field, particularly a kind of anti-rubbish mail gateway system and method based on the mass-mailer content clustering.
Background technology
Spam is generally defined as the Email with property: (one) addressee does not claim in advance or tendentious Email such as advertisement of agreeing to receive, electronic publication, various forms of propaganda materials; (2) Email that can't reject of addressee; (3) Email of information such as hiding sender's identity, address, title; (4) contain the Email of information such as false information source, sender, route.
Since the first envelope spam was born, spam had become the difficult problem of puzzlement mail user, had also become the raising user experience of mail operator, attraction user's significant consideration.The task of anti-rubbish mail is that spam is blocked in mailing system or beyond user's inbox.Main flow anti-rubbish technology mainly based on the behavior of posting a letter of Mail Contents and mail.
Existing anti-spam technologies based on Mail Contents mainly contains: the Dspam of the system that increases income (http://www.nuclearelephant.com can download through the website); The application number of Tencent Technology (Shenzhen) Co., Ltd. is 200810227762, denomination of invention is the patent application of " patent is to the method and apparatus of intercepting junk mail "; The application number of Zhejiang University is 200810059602, denomination of invention is the patent application of " the Chinese rubbish mail filtering method that returns based on Logistic "; The application number of Peking University is 200810115584, denomination of invention is for the patent application of " a kind of junk mail detection method " etc.
Above-mentioned anti-spam technologies mainly comprises on training and the line and uses two flow processs, is that example is introduced its several key steps during use on training and line below with Dspam, and all the other correlation techniques are similar basically.The training flow process of Dspam comprises following step: 1, obtain a large amount of mail samples and these sample manual works are designated spam and normal email; 2, mail is decoded; 3, the message body content is carried out participle; 4, add up the frequency that each participle occurs; 5, use Bayesian formula training Naive Bayes Classification model.After the Dspam model training is good, use flow process simple relatively on the line, only comprise following two steps: 1, mail on the line is carried out participle; 2, use the Naive Bayes Classification model that trains that mail is classified.
Based on the anti-rubbish mail strategy of the behavior of posting a letter in real time and having of content-based anti-rubbish mail strategy than big difference.Anti-garbage system based on real-time behavior is not generally trained this step.The post a letter anti-rubbish strategy of behavior of typical mail mainly contains Checksum (http://www.rhyolite.com/dcc/ can download through the website), and the application number of Harbin Engineering University is 200810064806, denomination of invention is for " a kind of method for judging rubbish mail based on topological behavior " patent application etc.Be that example is introduced its basic procedure below with Checksum.The basic assumption of Checksum is that the big mail of multiplicity is a spam, and its flow process is roughly following: 1, calculate a fingerprint to each mail; 2, count to the fingerprint of all mails of inline system; 3, directly be judged to spam for the high mail of fingerprint multiplicity.
It is the main flow of present commercial anti-garbage mail system aspect that Mail Contents and the behavior of posting a letter in real time combine.With Mail Contents and the behavioral trait of posting a letter in real time convert rule into, and take each rule accumulation bonus point, and whether be that spam is the effective means that both are combined according to the score threshold decision.Representational technology has; The application number of the SpamAssassin of the system that increases income (http://spamassassin.apache.org/ can download through the website), South China Science & Engineering University is 200710029369, denomination of invention is the patent application of " based on the anti-rubbish E-mail error filtering method and the system of integrated decision-making "; Is the bright mail system of business system Symantec Corporation (through website http://www.symantec.com/business/products/family.jsp? Familyid=brightmail can download), the Chinese opens the KBAS system (http://www.hanqinet.com/projectl.html can download through the website) of science and technology etc.With SpamAssassin is its main flow process of introducing of representative.SpamAssassin comprises two flow processs of use on training and the line.The training of rule-based anti-rubbish correlation technique mainly comprises following step: 1, obtain a large amount of mail samples and these sample manual works are designated spam and normal email; 2, manual work is added rule and is set up rule base; 3, use artificial sign sample that rule is marked.Use on the line and then comprise following two steps: 1, calculate every envelope mail matching rules; 2, whether be spam to all regular score summations of satisfying and according to threshold decision.
Mainly there is the deficiency of several aspects in existing anti-garbage mail system: A), lack effective feedback capture mechanism, feedback information can not effectively utilize.Though most of mailing system all has feedback mechanisms such as spam report; But the feedback information from the various channels of user feedback, honey jar mailbox, keeper's audit etc. is relatively independent, disperse; Lack effectively the mechanism of collecting, integrating and utilizing; Wherein the honey jar mailbox is a kind of special Email Accounts, and the mail that gets into wherein all is a spam.B), lack automatic study mechanism, can not in time respond the spam of unexpected outburst, and anti-garbage system is broken through by the anti-rubbish mail person easily.Existing anti-garbage mail system, all be based on prior learning well or the parameter that is provided with the email type of newly arriving is judged.This anti-rubbish mail thinking can not effectively be handled for the new spam type of unexpected outburst.Simultaneously, because the model relative fixed in the conventional garbage mailing system is found system features by the anti-rubbish mail person easily, cause system to be broken through by the spammer after a while and lost efficacy.C), misdetection rate height and False Rate are high.Existing anti-garbage mail system can not adapt to the anti-rubbish mail strategy that email type changes fast, part is external and not consider Chinese reasons such as special circumstances, causes higher misdetection rate.Simultaneously, because existing anti-garbage mail system lacks effectively erroneous judgement feedback mechanism, cause erroneous judgement effectively not correct, False Rate is too high.D), the manual examination and verification amount is big.Two links of existing system need more manual examination and verification.At first, can not differentiate the result for components of system as directed needs manual examination and verification, and this part audit amount is bigger.Secondly, train again in order to make system's new spam type needs of adaptation prepare sample, this part sample size of not only examining is big, and sample distribution is also had high requirement, causes difficulty big.
Summary of the invention
In order to solve the problems of the technologies described above, the present invention proposes a kind of anti-rubbish mail gateway system and method.
Anti-rubbish mail gateway system of the present invention comprises: the mail sample database is used to store various mail samples; Mail features is excavated module; Be used for obtaining the mail sample from the mail sample database; This mail sample and all central points are compared, if similitude less than certain threshold value then directly sample is joined this central point, wherein each central point is the representative of one type of sample; When calculating the similitude of mail sample and central point; Mail sample and central point are resolved to a plurality of partial contents respectively,, carry out the overall similitude that weighted array obtains mail sample and central point according to the similitude of various piece to each part similitude of the two relatively.
In addition, the invention allows for a kind of anti-rubbish mail method, this method comprises: the various mail samples of storage in the mail sample database; From the mail sample database, obtain the mail sample; This mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Wherein each central point is the representative of one type of sample, when calculating the similitude of mail sample and central point, mail sample and central point is resolved to a plurality of partial contents respectively; To each part similitude of the two relatively, carry out the overall similitude that weighted array obtains mail sample and central point according to the similitude of various piece.
Use anti-rubbish mail gateway system of the present invention and method; Advantage with the following aspects: 1) spam type of unexpected outburst etc. is all had adaptability preferably; The effective feedback collection mechanism that the present invention proposes can be unified timely collection with the mail of honey jar mailbox, user's report, keeper's audit; Can obtain the latest development of spam on the line in real time; And the on-line study module through mail features, can in time obtain the latest features situation of mail on the line, thereby can adapting to the spam type, the system that makes changes fast.2) the spam misdetection rate is low, real-time good.The invention provides the anti-rubbish module of two levels, be respectively online classification of mail module and off-line classification of mail module.Online mail online classification device is passing through loss part discovery rate; Promoted the real-time response ability of system; Off-line classification of mail device then can remedy the deficiency of online classification of mail device, obtains bigger spam discovery rate with bigger time-delay, plays the effect of mending the fold after the sheep is lost.The anti-rubbish mail gateway of the present invention that is used of on-line classification of mail device has obtained lower misdetection rate and good real-time performance.3) manual intervention is little.The present invention can extract the characteristic of mail automatically effectively through feedback capture mechanism and mail features mining algorithm; Do not need manual work that sample is examined; The keeper only need examine for the part mail features of excavating, and this part amount is considerably less.Therefore, use system and method for the present invention, the manual examination and verification amount is considerably less.4) system's contractility is good, and system can adapt to the anti-garbage mail system of multiple scale through revising the quantity that the mail distribution server is provided with the various sort module servers of dynamic increase and decrease, has good contractility.
Description of drawings
Fig. 1 is an anti-rubbish mail gateway system Organization Chart of the present invention;
Fig. 2 is the flow chart of of the present invention spam method;
Fig. 3 is the realization schematic diagram of feedback obtaining step in of the present invention the spam method;
Fig. 4 is the realization schematic diagram of mail features excavation step in of the present invention the spam method;
Fig. 5 is the realization schematic diagram of classification of mail step in of the present invention the spam method.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.
Fig. 1 shows the anti-rubbish mail gateway system Organization Chart that the present invention is based on the mass-mailer content clustering.
With reference to Fig. 1, gateway system of the present invention comprises mailing system interface, mail distribution module, online classification of mail module, off-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database
The mailing system interface is used to realize that anti-rubbish mail gateway and mailing system are various communicates by letter, and comprises from mail transport agent obtaining on the line mail in real time and giving the mail distribution module with the delivery of mail; The classification of mail result of online classification of mail module is returned to mail transport agent, and the spam of off-line classification of mail module is tabulated returns to mail transport agent; Setting up bulk mail derives connection, obtains functions such as type mail such as subscriber mailbox report, honey jar mailbox from mail server.
The mail distribution module; To get into the gateway system association requests and be distributed to respective modules according to its type; Mail requests will be transmitted to on-line classification of mail device on its center line, and feedback mail requests such as user's report, honey jar, keeper then will pass to mail sample collection module.The mail distribution module also need be responsible for the load balancing in each on-line classification of mail module, mail sample collection module simultaneously.
Online classification of mail module; Request that response mail distribution module sends over and mail distribution module connect and obtain the mail related content; Then according to existing normal/the spam characteristic classifies to mail on the line; And will whether be that the identification result of spam returns to mail transport agent in real time through former connection, the mail transmission of setting up when this former connection is illustrated in the request that response mail distribution module comes connects.Simultaneously, online classification of mail module also need connect through database interface and mail features database, and obtains up-to-date mail features from the mail features database at interval according to certain hour.Mail features in the mail features database will be by real-time update, and up-to-date mail features is meant through the mail features after the last update.
Off-line classification of mail module; Connect through database interface and mail features database; And at interval obtain up-to-date mail features from the mail features database according to certain hour; Use the mail features of up-to-date extraction that the buffer memory mail of the past period is classified then, classification results returns to mail transport agent with the form that needs move the mail tabulation of letter operation.
Mail sample collection module, the request that response mail distribution module sends over connects and obtains mail sample type and content, is principle with all kinds mail balanced proportion in the mail sample database, carries out the mail sample collection.The type of collecting the mail sample comprises, the normal email of the spam of user's report, user's report, from the mail of honey jar, keeper's auditing result etc.
Mail features is excavated module, is called by system management module, is used for obtaining the mail sample from the mail sample database, and this mail sample is excavated the characteristic of spam and normal email.At first mail features excavation module connects through database interface and mail sample database and obtains feedback samples; The mail features of system excavation module is analyzed this part sample then, and the mail features of excavating enters into the mail features database after examining through the system manager.
Mail features is excavated module and is used clustering algorithm from various feedback samples, to extract various types of mail features.Particularly, go out to report that from various feedback mail sample extraction quantity reaches the mail of certain threshold value, reject because the feedback information that interference and consumer taste add.For example, as find a certain type of spam that is the theme with invoice, if it is inferior to be surpassed threshold value (such as 100) by the report number of times, then such mail will be judged to spam, and the characteristic of this part mail is joined in the spam feature database.In addition, if mail such as news list is arranged, certain customers are spam with its report, and certain customers think that it is a normal email in addition, and this part mail can not be as the spam sample.
The clustering algorithm that the present invention adopts preferably adopts and improves the central point clustering algorithm; Each central point is the representative of one type of sample, and comprises the following aspects information: mail header template, short text are then then gathered for fingerprint mean value, IP set, the addresser of the mean value of corresponding fingerprint, annex for short text template, long text.A typical central point is following: the mail header template is " generation is opened * invoice * " (* is an asterisk wildcard); Short text template " my public * department opens the various VAT invoices of * * * for *, and the * of needs * contact button button 92342* is arranged ", long text fingerprint and annex fingerprint are the nilsimsa cryptographic hash of corresponding contents; The IP set is sender's IP tabulation; Like " 199.1.1.1 ", addresser's set is the mailbox tabulation of posting a letter, as Asdf163.comWhen a new mail sample gets into, this mail sample and all central points are now compared, if similitude less than certain threshold value then directly sample is joined this central point, and is upgraded this central point.The mail center point that cluster obtains is a mail features.Sample surpasses threshold value n in classification after the cluster, and report is less than threshold value t for the ratio of ham (normal email) sample, and then extracting this classification central point is spam (spam) sample.Improved central point clustering algorithm can be realized through following procedure.
Figure BDA0000097386280000071
In the superincumbent central point clustering algorithm, the similitude of mail sample and central point is calculated through following mode.When calculating the similitude of mail sample and a certain central point, the execution following steps: with mail resolve to mail header, several most contents such as the IP that posts a letter, addresser, text, annex; Body part is gone to disturb processing, extract mail structural framing, Chinese text, English text, other Languages text, this five bulks content of body structure information; Directly adopt set whether to have common factor to measure its similitude to enumerated variable such as IP; Adopt fingerprint to calculate both similitudes to long text information and annex; Then adopt the Needleman-Wunsch algorithm to confirm similitude between the two to short text; According to the similitude of various piece, carry out the overall similitude that weighted array obtains two envelope mails.
The similarity measurement algorithm of various piece is following: 1) enumerated variable such as IP, sender similarity measurement algorithm is: the IP that posts a letter of all mails constitutes a set in mail center point; When the similitude of two IP set of tolerance; If two IP common factor non-NULLs (promptly; Public IP is arranged), its similarity is defined as 1, otherwise is 0.Enumerated variable such as sender can be done similar processing.2) short text similarity measurement algorithm is: adopt the Needleman-Wunsch algorithm to confirm the Optimum Matching of two sequences.Algorithm principle and realize false code can referring to Http:// en.wikipedia.org/wiki/Needleman-Wunsch_algorithmAlgorithm need be confirmed three types of characters, and the coupling of Chinese, English, asterisk wildcard and mistake matching score can be carried out rough estimates according to data and obtained.After overmatching, the public part of two character strings is the template of two character strings, and different piece adopts asterisk wildcard to represent.3) long text similarity measurement algorithm is: adopt the comparison of nilsimsa fingerprint technique through the text similarity after the denoising.Can use the code of increasing income: Http:// ixazon.dynip.com/~cmeclax/nilsimsa.htmlRealize.
When new mail gets into; The anti-rubbish mail gateway at first uses online classification of mail module that this new mail is compared; If its similarity of envelope mail and this mail similarity are arranged less than threshold value t in the spam formation, then this mail is judged to spam, and the result is returned.The spam formation is the member of online classification of mail module the inside.The content of formation wherein obtains from the mail features database.Specific algorithm is following:
Figure BDA0000097386280000081
Figure BDA0000097386280000091
When new spam characteristic gets into the spam property data base; Off-line classification of mail module uses the mail in newfound characteristics of spam and all buffer queues to compare; If mail is arranged in the buffer queue and newly arrives spam characteristic similarity threshold value less than t; Then this mail is judged to spam, this mail is deleted from mail queue, and return results.Specific algorithm is following:
Figure BDA0000097386280000092
The mail distribution server at mail distribution module place is a master server; It keeps existing each server configures and the time-delay of each server process; Each new to mail, the delay of each server of master server training in rotation, and will newly be distributed to server with minimum delay to the mail sample.The processing time that it is up-to-date postponed to report to Distributor after each finished a mail from server process.
Figure BDA0000097386280000093
Continuation is with reference to Fig. 1, and system management module is used for setting and configuration file distribution, server performance monitoring and the optimizational function of various algorithm parameters.
Administrator interface is used for the system manager manual examination and verification that system excavates the mail features obtain is confirmed, the audit of part suspicious mail, being provided with etc. of various parameters.
Database interface is realized the unified interface and the access rights control of the database manipulations such as access, renewal of various mail samples, mail features.
The mail sample database, being used to store by user's report, keeper's audit and honey jar mailbox various has the label mail.
The mail features database is used to store mail features and excavates the various mail features that module obtains.
To sum up, anti-rubbish mail gateway of the present invention partly is made up of mailing system interface, mail distribution module, on-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database etc.Above-mentioned module is accomplished classification of mail together, feedback information is collected and mail features is excavated these three functions.In the classification of mail function; Anti-rubbish mail gateway of the present invention obtains information such as Mail Contents, user behavior information from mail transport agent through the mailing system interface; After using on-line classification of mail module that respective mail is classified, mail classes is returned to mail transmission server; In the feedback information collecting function, mail samples such as user feedback, honey jar mailbox and system manager's auditing result get into gateway system through mail exploder and mail sample collection module becomes learning sample; In the function that mail features is excavated, anti-rubbish mail gateway of the present invention excavates module through mail features and from feedback samples, excavates up-to-date spam characteristic, and corresponding characteristic is distributed to on-line classification of mail model.
Anti-rubbish mail gateway of the present invention carries out rubbish/normal email feature extraction based on feedback information.The user reports spam, report normal email, move feedback such as letter comprises a large amount of useful informations, has also comprised much noise simultaneously.The characteristic that the noise jamming of rejecting feedback information in time extracts rubbish/normal email is the key that the anti-rubbish mail gateway is realized self-teaching.
Anti-rubbish mail gateway of the present invention adopts the spam sorting algorithm; Particularly; In conjunction with existing normal/the spam characteristic; Mail to mail exploder is assigned is classified, and reaches the target of following three aspects: reduce low spam erroneous judgement rate, higher spam discovery rate and response speed faster.
Anti-rubbish mail gateway of the present invention adopts the dispatching algorithm of mail exploder; Mail is distributed to each processor in real time on the line that magnanimity is arrived at a high speed, realizes the processing logic of various mails, the load balancing of each server and the decentralized configuration of various services.
Fig. 2 is the flow chart that the present invention is based on the anti-rubbish method of mass-mailer content clustering.Fig. 3 is the realization schematic diagram of feedback obtaining step.Fig. 4 is the realization schematic diagram of mail features excavation step.Fig. 5 is the realization schematic diagram of classification of mail step.
With reference to Fig. 2; The method comprising the steps of: S201; Obtain on the line mail in real time and give the mail distribution module from mail transport agent through the mailing system interface the delivery of mail; The classification of mail result of online classification of mail module is returned to mail transport agent, and the spam of off-line classification of mail module is tabulated returns to mail transport agent, can be further with reference to Fig. 3 in this step; Can obtain normal email and spam sample from these three sources of system manager, user and honey jar, and these mails are passed through to get into the mail sample database after mail distribution module and the mail sample collection module.Mail sample collection module is a principle with all kinds mail balanced proportion in the mail sample database, carries out the mail sample collection.S202 is transmitted to on-line classification of mail device through the mail distribution module with mail requests on the line, and will pass to mail sample collection module through the mail requests of variety of way feedback.S203, utilize line classification of mail module according to existing normal/the spam characteristic classifies to mail on the line, and identification result returned to mail transport agent in real time, and obtains up-to-date mail features from the mail features database at interval according to certain hour.Come further to understand the classification of mail process with reference to Fig. 5, wherein mail transport agent gets into this antispam gateway with e-mail messages mailing system interface; The mailing system interface is transmitted to the mail distribution module with mail; The mail distribution module is given online classification of mail module, off-line classification of mail module and sample collection module according to the strategy of setting with mail distribution; Mail on-line sort module is classified to mail according to the information in the mail features library database, and the result is returned to mail transport agent according to the path of mail " mail distribution module, mailing system interface, mail transport agent "; The mail distribution module will be transmitted to applicator, and whether decision joins the sample storehouse with this mail to the sample collection module according to corresponding strategies.The difference of on-line classification of mail module is that online classification of mail module can be returned mail differentiation result in real time, and off-line classification of mail module then adopts asynchronous mode that the differentiation result of mail is returned to mail transport agent.S204; Utilize off-line classification of mail module to obtain up-to-date mail features from the mail features database at interval according to certain hour; Use the mail features of up-to-date extraction that the buffer memory mail of the past period is classified, and classification results is returned to mail transport agent; S205, the request through mail sample collection module responds mail distribution module sends over connects and obtains mail sample type and content; S206 excavates module through mail features and from the mail sample database, obtains the mail sample, and therefrom excavates the characteristic of spam and normal email, and the mail features that will excavate enters into the mail features database after through system manager's audit.Further with reference to Fig. 4; In this mail features excavation step; System at first extracts the mail sample of nearest a period of time from the mail sample database; The mail features of system excavation module will be carried out cluster analysis to sample then, and the mail features of excavating joins the mail features database after examining through the system manager.In process of cluster analysis; This mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Wherein each central point is the representative of one type of sample, when calculating the similitude of mail sample and central point, mail sample and central point is resolved to a plurality of partial contents respectively; To each part similitude of the two relatively, carry out the overall similitude that weighted array obtains mail sample and central point according to the similitude of various piece.When comparing the similitude of mail sample and central point to each part; Adopt set whether to have common factor to measure its similitude to enumerated variable; Adopt fingerprint to calculate both similitudes to long text information and annex, adopt the Needleman-Wunsch algorithm to confirm similitude between the two short text.Mail features to excavation obtains is carried out manual examination and verification affirmation, the audit of part suspicious mail, the setting of various parameters.S207, the various mail samples of storage in the mail sample database.
Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely specific embodiment of the present invention; Be not limited to the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. anti-rubbish mail gateway system, it comprises:
The mailing system interface; Be used for obtaining on the line mail in real time and giving the mail distribution module with the delivery of mail from mail transport agent; The classification of mail result of online classification of mail module is returned to mail transport agent, and the spam of off-line classification of mail module is tabulated returns to mail transport agent;
The mail distribution module is used for mail requests on the line is transmitted to on-line classification of mail device, will pass to mail sample collection module through the mail requests of variety of way feedback;
Online classification of mail module, be used for according to existing normal/the spam characteristic classifies to mail on the line, and identification result returned to mail transport agent in real time, and obtains up-to-date mail features from the mail features database at interval according to certain hour;
Off-line classification of mail module is used for obtaining up-to-date mail features from the mail features database at interval according to certain hour, uses the mail features of up-to-date extraction that the buffer memory mail of the past period is classified, and classification results is returned to mail transport agent;
Mail sample collection module, the request that response mail distribution module sends over connects and obtains mail sample type and content;
Mail features is excavated module, be used for obtaining the mail sample from the mail sample database, and therefrom excavate the characteristic of spam and normal email, and the mail features that will excavate enters into the mail features database after through system manager's audit;
The mail sample database is used to store various mail samples.
2. Mail Gateway as claimed in claim 1 system; It is characterized in that mail features is excavated module and also is used for obtaining the mail sample from the mail sample database, and this mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Wherein each central point is the representative of one type of sample, when calculating the similitude of mail sample and central point, mail sample and central point is resolved to a plurality of partial contents respectively; To each part similitude of the two relatively, carry out the overall similitude that weighted array obtains mail sample and central point according to the similitude of various piece.
3. Mail Gateway as claimed in claim 2 system; It is characterized in that; When comparing the similitude of mail sample and central point to each part; Adopt set whether to have common factor to measure its similitude to enumerated variable, adopt fingerprint to calculate both similitudes, adopt the Needleman-Wunsch algorithm to confirm similitude between the two short text to long text information and annex.
4. Mail Gateway as claimed in claim 3 system; It is characterized in that, when a new mail sample gets into, this mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Sample surpasses a threshold value in classification after the cluster, and report is less than another threshold value for the ratio of normal email sample, and then extracting this classification center is the spam sample.
5. Mail Gateway as claimed in claim 4 system is characterized in that said system further comprises:
Administrator interface is used for the system manager audit of part suspicious mail, the setting of various parameters is confirmed in the manual examination and verification that gateway system excavates the mail features that obtains.
6. anti-rubbish mail method, the method comprising the steps of:
Obtain on the line mail in real time and give the mail distribution module from mail transport agent through the mailing system interface the delivery of mail; The classification of mail result of online classification of mail module is returned to mail transport agent, and the spam of off-line classification of mail module is tabulated returns to mail transport agent;
Through the mail distribution module mail requests on the line is transmitted to on-line classification of mail device, and will passes to mail sample collection module through the mail requests of variety of way feedback;
Utilize line classification of mail module according to existing normal/the spam characteristic classifies to mail on the line, and identification result returned to mail transport agent in real time, and obtains up-to-date mail features from the mail features database at interval according to certain hour;
Utilize off-line classification of mail module to obtain up-to-date mail features from the mail features database at interval, use the mail features of up-to-date extraction that the buffer memory mail of the past period is classified, and classification results is returned to mail transport agent according to certain hour;
Request through mail sample collection module responds mail distribution module sends over connects and obtains mail sample type and content;
Excavate module through mail features and from the mail sample database, obtain the mail sample, and therefrom excavate the characteristic of spam and normal email, and the mail features that will excavate enters into the mail features database after through system manager's audit;
The various mail samples of storage in the mail sample database.
7. method as claimed in claim 6; It is characterized in that mail features is excavated module and also is used for obtaining the mail sample from the mail sample database, and this mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Wherein each central point is the representative of one type of sample, when calculating the similitude of mail sample and central point, mail sample and central point is resolved to a plurality of partial contents respectively; To each part similitude of the two relatively, carry out the overall similitude that weighted array obtains mail sample and central point according to the similitude of various piece.
8. method as claimed in claim 7; It is characterized in that; When comparing the similitude of mail sample and central point to each part; Adopt set whether to have common factor to measure its similitude to enumerated variable, adopt fingerprint to calculate both similitudes, adopt the Needleman-Wunsch algorithm to confirm similitude between the two short text to long text information and annex.
9. method as claimed in claim 8; It is characterized in that, when a new mail sample gets into, this mail sample and all central points are compared; If similitude is less than certain threshold value then directly sample is joined this central point; Sample surpasses a threshold value in classification after the cluster, and report is less than another threshold value for the ratio of normal email sample, and then extracting this classification center is the spam sample.
10. method as claimed in claim 9 is characterized in that, further comprises:
Mail features to excavation obtains is carried out manual examination and verification affirmation, the audit of part suspicious mail, the setting of various parameters.
CN201110304470.3A 2011-10-10 2011-10-10 Anti-spam gateway system and method Active CN102377690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110304470.3A CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110304470.3A CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Publications (2)

Publication Number Publication Date
CN102377690A true CN102377690A (en) 2012-03-14
CN102377690B CN102377690B (en) 2014-09-17

Family

ID=45795681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110304470.3A Active CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Country Status (1)

Country Link
CN (1) CN102377690B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN103744888A (en) * 2013-12-23 2014-04-23 新浪网技术(中国)有限公司 Method and system for anti-spam gateway to query database
CN103841006A (en) * 2014-02-25 2014-06-04 汉柏科技有限公司 Method and device for intercepting junk mails in cloud computing system
CN104796318A (en) * 2014-07-30 2015-07-22 北京中科同向信息技术有限公司 Behavior pattern identification technology
CN108197638A (en) * 2017-12-12 2018-06-22 阿里巴巴集团控股有限公司 The method and device classified to sample to be assessed
CN108737255A (en) * 2018-05-31 2018-11-02 北京明朝万达科技股份有限公司 Load-balancing method, load balancing apparatus and server
CN112579733A (en) * 2019-09-30 2021-03-30 华为技术有限公司 Rule matching method, rule matching device, storage medium and electronic equipment

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696943A (en) * 2004-05-13 2005-11-16 上海极软软件技术有限公司 Self-adaptive method for filtering out garbage E-mails safely
CN1716293A (en) * 2004-06-29 2006-01-04 微软公司 Incremental anti-spam lookup and update service
GB2425855A (en) * 2005-04-25 2006-11-08 Messagelabs Ltd Detecting and filtering of spam emails
CN101083630A (en) * 2006-06-01 2007-12-05 珠海金山软件股份有限公司 Anti-rubbish E-mail system and method
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail
CN101119341A (en) * 2007-09-20 2008-02-06 腾讯科技(深圳)有限公司 Mail identifying method and apparatus
CN101136874A (en) * 2007-07-25 2008-03-05 华南理工大学 Compound decision based anti-rubbish E-mail error filtering method and system
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101295381A (en) * 2008-06-25 2008-10-29 北京大学 Junk mail detecting method
CN101299729A (en) * 2008-06-25 2008-11-05 哈尔滨工程大学 Method for judging rubbish mail based on topological action
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system
CN102075447A (en) * 2009-11-25 2011-05-25 中兴通讯股份有限公司 Method and system for anti-spam mails

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696943A (en) * 2004-05-13 2005-11-16 上海极软软件技术有限公司 Self-adaptive method for filtering out garbage E-mails safely
CN1716293A (en) * 2004-06-29 2006-01-04 微软公司 Incremental anti-spam lookup and update service
GB2425855A (en) * 2005-04-25 2006-11-08 Messagelabs Ltd Detecting and filtering of spam emails
CN101083630A (en) * 2006-06-01 2007-12-05 珠海金山软件股份有限公司 Anti-rubbish E-mail system and method
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail
CN101136874A (en) * 2007-07-25 2008-03-05 华南理工大学 Compound decision based anti-rubbish E-mail error filtering method and system
CN101119341A (en) * 2007-09-20 2008-02-06 腾讯科技(深圳)有限公司 Mail identifying method and apparatus
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101295381A (en) * 2008-06-25 2008-10-29 北京大学 Junk mail detecting method
CN101299729A (en) * 2008-06-25 2008-11-05 哈尔滨工程大学 Method for judging rubbish mail based on topological action
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system
CN102075447A (en) * 2009-11-25 2011-05-25 中兴通讯股份有限公司 Method and system for anti-spam mails

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN103441924B (en) * 2013-09-03 2016-06-08 盈世信息科技(北京)有限公司 A kind of rubbish mail filtering method based on short text and device
CN103744888A (en) * 2013-12-23 2014-04-23 新浪网技术(中国)有限公司 Method and system for anti-spam gateway to query database
CN103841006A (en) * 2014-02-25 2014-06-04 汉柏科技有限公司 Method and device for intercepting junk mails in cloud computing system
CN104796318A (en) * 2014-07-30 2015-07-22 北京中科同向信息技术有限公司 Behavior pattern identification technology
CN108197638A (en) * 2017-12-12 2018-06-22 阿里巴巴集团控股有限公司 The method and device classified to sample to be assessed
CN108197638B (en) * 2017-12-12 2020-03-20 阿里巴巴集团控股有限公司 Method and device for classifying sample to be evaluated
CN108737255A (en) * 2018-05-31 2018-11-02 北京明朝万达科技股份有限公司 Load-balancing method, load balancing apparatus and server
CN108737255B (en) * 2018-05-31 2020-07-10 北京明朝万达科技股份有限公司 Load balancing method, load balancing device and server
CN112579733A (en) * 2019-09-30 2021-03-30 华为技术有限公司 Rule matching method, rule matching device, storage medium and electronic equipment
WO2021063089A1 (en) * 2019-09-30 2021-04-08 华为技术有限公司 Rule matching method, rule matching apparatus, storage medium and electronic device
CN112579733B (en) * 2019-09-30 2023-10-20 华为技术有限公司 Rule matching method, rule matching device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN102377690B (en) 2014-09-17

Similar Documents

Publication Publication Date Title
CN102377690B (en) Anti-spam gateway system and method
US7930353B2 (en) Trees of classifiers for detecting email spam
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
CN112567407B (en) Privacy preserving tagging and classification of emails
US6928465B2 (en) Redundant email address detection and capture system
CN102413076A (en) Spam mail judging system based on behavior analysis
US7660865B2 (en) Spam filtering with probabilistic secure hashes
KR101117866B1 (en) Intelligent quarantining for spam prevention
Katirai et al. Filtering junk e-mail
CN101637002A (en) A method and system for collecting addresses for remotely accessible information sources
CN104067567A (en) Systems and methods for spam detection using character histograms
CN104040963A (en) System and methods for spam detection using frequency spectra of character strings
CN101330476A (en) Method for dynamically detecting junk mail
CN103379020A (en) Method and system for massively sending emails
CN102124485B (en) Apparatus, and associated method, for detecting fraudulent text message
Bhat et al. Classification of email using BeaKS: Behavior and keyword stemming
Das et al. Analysis of an image spam in email based on content analysis
Dada et al. Random forests machine learning technique for email spam filtering
CN103595614A (en) User feedback based junk mail detection method
Anitha et al. Email spam filtering using machine learning based xgboost classifier method
US10163005B2 (en) Document structure analysis device with image processing
Mishra et al. Analysis of random forest and Naive Bayes for spam mail using feature selection categorization
Johansen et al. Email Communities of Interest.
Gonzalez-Talavan A simple, configurable SMTP anti-spam filter: Greylists
CN106713108B (en) A kind of process for sorting mailings of combination customer relationship and bayesian theory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant