CN100583840C - Spam mail identify method based on interest cognition and system thereof - Google Patents

Spam mail identify method based on interest cognition and system thereof Download PDF

Info

Publication number
CN100583840C
CN100583840C CN200610124174A CN200610124174A CN100583840C CN 100583840 C CN100583840 C CN 100583840C CN 200610124174 A CN200610124174 A CN 200610124174A CN 200610124174 A CN200610124174 A CN 200610124174A CN 100583840 C CN100583840 C CN 100583840C
Authority
CN
China
Prior art keywords
mail
spam
client
attribute
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200610124174A
Other languages
Chinese (zh)
Other versions
CN1976323A (en
Inventor
皮佑国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN200610124174A priority Critical patent/CN100583840C/en
Publication of CN1976323A publication Critical patent/CN1976323A/en
Application granted granted Critical
Publication of CN100583840C publication Critical patent/CN100583840C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

A method for identifying junk mail based on interest-cognition includes setting up and maintaining knowledge base of private interest-cognition, carrying out probability calculation and mail attribute evaluation and outputting results of calculation and evaluation. The identification system used for realizing said method is also disclosed.

Description

A kind of spam recognition methods and system thereof based on interest cognition
Technical field
The invention belongs to the computer information processing field, is a kind of method and system thereof that is used to discern and filter spam in personal computer specifically.Present technique is used for personal computer to spam identification with carry out respective handling.
Background technology
Spam increases sharply in recent years, not only takies Internet resources, influences the normal operation of mailing system, also wastes the resource and the time of mail user greatly.Current, the filtering technique of spam is mainly contained: (white list and blacklist filter), rule-based filtering and information filtering are filtered in behavior.Wherein, white list and blacklist filtering technique are simple, but need real-time update white list and blacklist, and are difficult to guarantee to have only limited, fixing several users sending spam.Rule-based filter method promptly is provided with some rules, as long as meet these rules one or several, just thinks spam; These rules have other features of the analysis of letter head, mass-sending filtration, keyword coupling, Mail Contents etc. usually; Adopt rule-based filtering technique, really can take precautions against spam within a certain period of time well, but it all is artificial appointment that its weak point is rule, need people constantly to go discovery and summary, upgrade, human factor is many, and some unfamiliar users may be difficult to provide effective rule; And it is more consuming time to lay down a regulation by hand, and accuracy rate also is restricted.Information filtering mainly is text classification, and so-called text classification is exactly by certain algorithm, input text is analyzed, and according to the result text is divided into normal email or spam.In text classification, more employing keyword filters, and promptly includes some keyword in the file, thinks that promptly this document is spam or normal email.Xian Jin technology also begins to be used for anti-rubbish mail more.
Application number is that 200410009854 Chinese invention patent application discloses a kind of " method and system of Spam filtering ", and this technology is expressed as common suffixes tree (GST) structure respectively with spam in the original e-mail storehouse and legitimate mail.For newly arrived mail, random length statistics by each text position, automatically obtain it in the spam collection and the concentrated frequency of occurrences of legitimate mail, calculate the degree of approximation of itself and spam collection and legitimate mail collection, finally determine that new to arrive mail be the attribute of spam or normal email.
Application number is that 200410018327 Chinese invention patent application discloses " method of a kind of self adaptation, safety filtering spam ", and this method is set up central authorities and local two rule bases; Wherein, central rule base forms on server automatically, and local rules repository forms on user's PC automatically.On user's PC, utilize post-processing system to calculate the mail score value that is received, judge whether the mail that receives is spam according to central rule base and local rules repository.Central authorities' rule base and local rules repository are learnt to upgrade automatically, and promptly central rule base upgrades automatically on server, and local rules repository is upgraded automatically on user's PC, and user's PC regularly obtains up-to-date central rule base automatically.Mail Contents process intellectual analysis to the user accepted is retained in inbox with legitimate mail automatically, and spam is deposited isolated area, reduces the False Rate of legitimate mail when improving the spam discrimination, thereby the user is saved time and energy.
Application number is that 200510114440 Chinese invention patent application discloses " a kind of method of filtering spam ", with the DNA pattern recognition module normal email of input and spam is gathered earlier and is carried out pattern recognition, deposits DNA pattern storehouse in; Utilize the feature mode word-dividing mode that mail is detected again, testing process is followed successively by: the tested message body through certain algorithm coding is decoded, discern the pattern that it comprises; Tested mail is carried out the auxiliary participle of DNA, discern the feature mode that is comprised in message body and the title according to DNA pattern storehouse, and it is identified out; To be reassembled into the mail that satisfies particular requirement through the message body and the title of above-mentioned processing, deliver to Bayes's detection system; To discern through the mail of above-mentioned processing by Bayes's detection system, the mail interception that does not meet class condition will be got off.
The applicant thinks: spam is a kind of mail or recipient uninterested mail useless to the recipient, a but same envelope mail, may have different attributes for different recipients, some recipients think useful normal email, and the other recipient thinks spam.Regrettably anti-spam technologies up to now comprises the above-mentioned patent application technical scheme of mentioning, all recipient's work and life interest is not discerned, and can not form a kind of intelligent spam treatment technology based on interest cognition.
Summary of the invention
The objective of the invention is to overcome the shortcoming and defect of prior art, a kind of spam recognition methods based on interest cognition is provided; This method is come cognitive client's the work and the interest of living by mail that the client is sent and the processing of docking receiving emails, and is filtered out spam, the preservation normal email according to this client's above-mentioned work and life interest from cognitive mechanism.Also can be used in combination with other anti-rubbish mail, in this case, fingerprint of representing the mail attribute of this method output is so that the system decision-making is judged as foundation.
Another object of the present invention is to provide a kind of spam recognition system that realizes said method based on interest cognition.
Purpose of the present invention is passed through down, and technical scheme realizes: a kind of spam recognition methods based on interest cognition comprises the steps---
1. the foundation and the maintenance of the cognitive knowledge base of personal interest
1.1, comprise in the mail of mail that the client sends and reception by other filtration system is qualitative being the mail of spam, so that cognitive client's live and work interest by collecting the mail of client contact to greatest extent.Mail matter topics, the Mail Contents of the mail that the present invention at first sends the client and receives adopt natural language understanding artificial intelligence technology---participle technique is decomposed into word (Chinese word segmentation).
1.2 being that index is set up, upgraded and the storehouse that expands knowledge by 1.1 words that obtain, the word that does not have is added and by 1.3 registration attribute probability; The word that has had in the knowledge base is considered that just new incident recomputates and refreshes its attribute probability, realizes the accumulation and the renewal of knowledge base.Bringing into use when of the present invention, the content in the knowledge base is zero, by the collection to user mail, under user's guidance, sets up knowledge base.And progressively accumulate along with the increase of user mail and refresh one's knowledge.
1.3 the attribute probability in the knowledge base is determined according to following rule and is refreshed: to words all in the mail that sends, all be designated as the sample that appears in the normal email; The word of butt joint in the receiving emails, in training period, the attribute record sample number of determining according to the client after finishing in training period, is charged to sample number according to the attribute of the system decision-making.
1.4 total sample number that the knowledge base word is occurred is set with threshold value, so that the attribute maturity of this word is identified.Only when total sample number of each participle is higher than this threshold value, just allow the end training period.
2. probability calculation of mail attribute and mail attribute evaluation
2.1 calculate the conditional probability that each word occurs respectively in mail to be evaluated according to the attribute probability that draws in the total words and 1.2 and 1.3 that draws in 1.1.
2.2 utilize 2.1 result to utilize the Bayes formula to calculate the attribute probability of mail.
2.3 carry out attribute evaluation according to given threshold value.
2.4 in training period, the result of decision of estimating the result of decision and client is compared and revises decision-making value.Only when the result of decision and client's result of decision reach near the time, could finish training period.
3. result's output
3.1 when the present invention uses separately, in training period, show the mail attribute fingerprint of estimating (attribute probability).After training period finishes, spam is put into isolated area.
3.2 unite when using at the present invention and other anti-spam technologies, to specified interface output mail attribute fingerprint (attribute probability).
In the step 1.1, the mail that described client is sent is as normal email and have the highest weight; Because the mail that the client sends has reflected this client's work and social life interest to a certain extent, so corresponding fingerprint base is analyzed and set up to present technique with client's personal interest (language feature participle), the occurring once more or repeatedly occur and will revise of same participle to the fingerprint of this participle.
In the step 1.1, the mail that described client receives comprises two classes, and a kind of is interested normal email, and a kind of is uninterested spam.The present invention adopts the training study that the tutor is arranged, and in training period, it is qualitative that the mail that described client receives will require the user to give, after training period, finish, will by system calculate automatically and estimate give qualitative.By the word of qualitative mail to removing to recomputate the attribute probability of this word as incident.
In the step 1.1, the described step that is decomposed into word is with the speech in the phrase of the mail matter topics of the mail of client's transmission and reception, Mail Contents keyword, phrase, sentence, the literary composition section separately.The speech technology of separating in Chinese phrase, phrase, sentence, the literary composition section is called Chinese words segmentation.
In the step 1.2, described knowledge accumulation and renewal comprise two aspects: (A) interpolation of speech; When new mail entered, system was retrieved dictionary by the speech of new mail rapidly, when not having the speech of retrieval in the dictionary, just this speech and probability thereof was added in the knowledge base.(B) the attribute probability of speech upgrades; When new mail entered, system was retrieved dictionary by the speech of new mail rapidly, when the speech of existing retrieval in the dictionary, accessed with regard to previous probability, recomputated probability and by the probability of this speech in this refresh bank in conjunction with this incident.No matter be which kind of mail (mail of transmission, the normal email of reception and spam) input, in the process of participle, all will retrieve,, add in the storehouse the speech that does not have in the feature database to the speech in the storehouse; To existing participle in the storehouse, will recomputate the probability of this participle and carry out the storehouse and safeguard according to the character of mail.
Training period described in the step 1.3, bring into use the moment of the present invention from the user, the end of so-called training period, two signs are arranged: one is all words in the envelope mail, the total degree of the word that occurrence number (total sample number) is minimum is greater than a certain pre-set threshold; The 2nd, system evaluation and training period,, the artificial evaluation result degree of approximation surpassed another pre-set threshold.When an envelope mail satisfied above-mentioned two conditions, it is qualitative that system will not point out the client to carry out automatically.When the user has new social life hobby or work of transformation be, the word that occurs in the mail can not satisfy above-mentioned condition, and system enters training period automatically
A kind of spam recognition system that realizes said method based on interest cognition, comprise participle parts, spam probability calculation parts, knowledge base parts, evaluation of classification parts, attribute evaluation output block, described participle parts are connected with spam probability calculation parts and evaluation of classification parts simultaneously, spam probability calculation parts and evaluation of classification parts interconnect, both are connected with the knowledge base parts simultaneously, and described attribute evaluation output block is connected with the evaluation of classification parts.
Starting point of the present invention is: it is considered herein that spam is to vary with each individual.For example, is legitimate mail for an advertising propaganda mail about training of human resource for the people who is engaged in human resource management, they can therefrom obtain corresponding information and knowledge, then are useless spams for technical staff, financial staff and other personnel.Equally, the mail of stock knowledge and information is a useful information for the stock invester, is legitimate mail; And be exactly spam for non-stock invester with to the people who has no stomach in the stock market.Therefore, judge whether an envelope mail is spam, just should carry out cognition client's work and life interest.Mail matter topics, Mail Contents in the Email are all expressed by speech, the present invention adopts Chinese words segmentation that the theme of mail, the keyword of content part are separated into word, these words will reflect the interest characteristics of client's work and life so, if the frequency height that some speech occurs in the normal email that the client sends and receives, and the frequency that occurs in the spam qualitatively the client is low, if this or these speech in the mail that this client receives once more, occurs, show that then this mail is that the possibility of legitimate mail is bigger.Thereby, the degree of application of the present invention depends on the degree to the personal interest cognition, and employing of the present invention accumulates knowledge base with people's cognitive identical mechanism, after adopting the present invention, knowledge in the knowledge base begins accumulation as the baby, but and build up the effective evaluation mail and form the dictionary of mail fingerprint, utilize described mail fingerprint can describe the character of mail (spam or normal email).
The present invention has following advantage and effect with respect to prior art:
(1) biggest advantage of the present invention is the reality that suits the client, use client's personal lifestyle and job interest to carry out the intelligence cognition to mail, the present invention does not require certain employing spam corpus (yet can adopt as initial word attribute), form knowledge base but train, so human features is outstanding by the individual.On effect, just can filter spam effectively and preserve legitimate mail effectively.
(2) personal lifestyle of the present invention and job interest knowledge base, adopt in good time study and method for refreshing, to the client accept and each envelope mail of sending all as the process of study, the appearance of the word that above-mentioned each envelope mail is comprised is all added up as incident.Therefore knowledge base is a knowledge base of learning continuously and refreshing, and this will make the filter effect of spam continue to keep on the basis that improves constantly.
(3) the present invention adopts the machine learning that the tutor is arranged to knowledge base, bringing into use when of the present invention, the user is not as using the present invention to operate, and different is the mail that each sealing-in is subjected to, its attribute all will be putd question to the user by system, accept user tutor's guidance.When the mail that a sealing-in is received has reached when finishing the requiring of training period, the mail of spam can automatic fitration falls to be evaluated as in system, legal mail is preserved, but do not putd question to the user.Be evaluated as legal and mail that pass at filter of the present invention, the user thinks to filter the spam of missing, and then can indicate when deletion, and system can accept to instruct the event attribute that changes in the knowledge base automatically.This mode is pressed close to the client more, has guaranteed the effect of filtering.
(4) the present invention is honest and intelligence for the boundary line of training period and duty cycle, and system completely is divided into training period and duty cycle, but admits what one really understands and admit what one does not know honestly.When the content of an envelope mail, system knowledge base has enough knowledge to be estimated and when making a strategic decision, just decision-making and handling of system; When the content of an envelope mail, system knowledge base does not have enough knowledge to be estimated and when making a strategic decision, system is just learnt by puing question to the user.Its outstanding advantage is, has guaranteed filter effect more meticulously.When client's live and work interests change, can adapt to new environment simultaneously with the client.Even if when client's live and work interests change, also can guarantee to reflect the filter effect of client's interest.
(5) the present invention is based on the filter method of Mail Contents, and belongs to the filter method based on statistics therein again, but the present invention does not repel other method, can be used in combination with other filter method.For example, utilize the method for rule such as blacklist and white list to filter after, the present invention carries out information filtering to the mail by above-mentioned filtration again, has improved the effect of filtering on original basis greatly.
Description of drawings
Fig. 1 is a block flow diagram of the present invention.
Embodiment
The present invention is described in further detail below in conjunction with embodiment and accompanying drawing, but embodiments of the present invention are not limited thereto.
Embodiment
Fig. 1 shows the structure of system of the present invention, as seen from Figure 1, this spam recognition system based on the personal interest cognition comprises participle parts (4), spam probability calculation parts (5), knowledge base (6), evaluation of classification parts (7), attribute evaluation output block (8), described participle parts are connected with spam probability calculation parts and evaluation of classification parts simultaneously, spam probability calculation parts and evaluation of classification parts interconnect, both are connected with knowledge base simultaneously, and described attribute evaluation output block is connected with the evaluation of classification parts.
The implementation process based on the spam recognition methods of interest cognition that native system is realized is specific as follows:
1, mail collection
The present invention all samples the mail that the client sends and receives by client's mailbox, as live and work interest knowledge cognitive and the accumulation client.Obviously, the client send mail be legitimate mail to this client, if incident is wanted weighting, then the mail that sends of client has the highest weight.If the client has used other twit filter, then the mail of Jie Shouing also will be divided into two kinds: qualitative for spam mail and treat mail qualitatively.Parts 1 receive qualitative to be this class mail of spam, if the client does not use other twit filter, then just not have parts 1 in the system among Fig. 1.Parts 2 receive the mail that is not filtered out by other twit filter, or perhaps the normal email passed through of other filter.If the client does not use other twit filter, then parts 2 will receive the mail of whole receptions.Parts 3 are to receive the mail that the client sends.After above-mentioned three parts receive the mail in above-mentioned three kinds of sources, all the mail of receiving is delivered to participle parts 4.Above-mentioned three kinds of mail reception adopt corresponding reproduction technology.
2, the foundation of knowledge base and maintenance
The present invention will utilize cognitive mechanism foundation to meet the knowledge base of client's work and life interest, client's above-mentioned interest will be reflected in the mail of its transmission and reception, and the reflection Mail Contents is that the speech of the sentence of forming mail head and mail body, phrase is in legitimate mail and the frequency that occurs in spam.
The present invention at first becomes word with the mail of above-mentioned reception and transmission through word segmentation processing, adds up these speech then and appears at the frequency that occurs in spam and the legitimate mail respectively, forms the attribute probability of this speech.In running, above-mentioned knowledge base is along with the constantly study and the renewal of increase of mail.
Parts 4 are participle parts, and its function is that the speech in the phrase, phrase, sentence, literary composition section of mail matter topics, the Mail Contents of the mail that will be sent here by parts 1, parts 2 and parts 3 separately becomes word.Then parts 5 being sent into one by one in the word of above-mentioned mail handles.
Parts 5 are word property calculation parts, and the attribute probability of the word that it mainly provides according to parts 4 calculates and knowledge base is safeguarded.Concrete attended operation comprises: (A) interpolation of speech; System is retrieved knowledge base 6 by the word that word segmentation processing obtains, and when not having the speech of retrieval in the dictionary, just this speech and probability thereof is added in the knowledge base 6.(B) renewal of word attribute probability; When in the knowledge base 6 during the existing speech that is retrieved, just previous probability is accessed, recomputate probability and by the probability of this speech in this refresh bank in conjunction with this incident.Set up or the maintenance knowledge storehouse in probability the time, to sending the word that occurs in the mail, the attribute of current event is legal, to by other filter spam qualitatively, the attribute of current event is illegal, attribute for the not qualitative mail that receives will carry out the qualitative of incident according to the evaluation result of decision of decision component 7.Therefore, the input of parts 5 is from parts 4,6 and 7.Output to parts 6.
Parts 5 are when calculating the probability of each word, and also the total degree that this speech is occurred identifies, and this sign can show whether total degree that this speech occurs reaches the number of times of predefined end training period and deposit word attribute knowledge base 6 in.
Parts 6 are word attribute knowledge bases, and its function is the knowledge of storage based on personal interest.In fact be exactly stores words and attribute probability thereof.Parts 6 are accepted the inquiry of parts 5 and parts 7, and the information of also accepting parts 5 writes.
3, evaluation of classification decision-making
The present invention utilizes the attribute of the content of 7 pairs of mails of evaluation of classification parts to calculate with attribute and makes a strategic decision.The function of evaluation of classification parts 7 is that mail is carried out categorised decision.The concrete operations flow process is: to the mail (from parts 1 and parts 2) that receives, after participle parts 4 are divided into word, enter evaluation of classification parts 7, evaluation of classification parts 7 at first access the attribute probability of each word speech from knowledge base, calculate the attribute of mail then according to statistical decision method Bayes sorting techniques such as (Bayes) (also can be as KNN, SVM, Winnow, Rocchio).Attribute is to represent with the form of probability, and a predefined evaluation criterion is arranged in the parts 7, and decision-making is legitimate mail when the probability that calculates reaches this evaluation criterion, otherwise is spam.Therefore, the word sent here of parts 7 receiving-members 4 and extract the attribute probability of words from knowledge base parts 6.The evaluation result of parts 7 is delivered to parts 5 and parts 8 respectively, and parts 5 recomputate the attribute probability of word and refresh knowledge base according to the evaluation result of parts 7.Parts 8 are exported the evaluation of classification result with suitable form.
4, evaluation of classification output
Evaluation of classification output realizes by estimating output block 8.The function of estimating output block 8 is that the result of decision of evaluation of classification parts 7 is exported with suitable form.
Evaluation for the mail that has been filtered is to check, to consider filter result according to certain weight, have only to calculate and just point out when the probability belong to normal email reaches higher degree (probability threshold value preestablishes), otherwise do not do substantive output when this decision-making.
The output processing of not qualitative mail still is the major function of parts 8 in the butt joint receiving emails.From system formation aspect, the present invention can be divided into independent use and with the integrated use of other filter method.Being output as the result of decision when using separately or using with serial mode with other filter, promptly is legitimate mail or spam.Described filter in using with other filter serial mode be meant filter before mail enters filter of the present invention or this filter filter after reentrant filter.Integrated occupation mode is meant that multiple filter method calculates respectively, and integrated filter comprehensively carries out the occupation mode of attribute Decision-Making Evaluation again by certain rule according to the calculated in various ways result.In integrated occupation mode, what filter of the present invention was exported is attribute probability or attribute fingerprint.
From the operating state of filter of the present invention, filter of the present invention can be divided into training period and duty cycle two states.Enter duty cycle two conditions are arranged, one is in the probability that will promptly retrieve from knowledge base above predefined minimum number of all words occur in the envelope mail total degree the probability that does not have total sample number not reach predefined minimum number as yet to be arranged, the 2nd, and the evaluation result of filter of the present invention and customer evaluation result have reached the predefined degree of approximation.These parts are all detecting these two signs when parts 4 receive the mail word each time.
In training period, this functions of components is: put question to and write down and the decision-making of client's decision-making with parts 7 compared to the client, determine whether to enter second sign of duty cycle.Selection output according to the client.Deletion or Quarantine Spam are preserved legitimate mail when using separately filter of the present invention.The print scores of output highest weighting when integrated use.
In duty cycle, this functions of components is: no longer put question to the client, directly the result of decision of output block 7.Deletion or Quarantine Spam are preserved legitimate mail when using separately filter of the present invention.The print scores that output block 7 is calculated when integrated use.

Claims (8)

1, a kind of spam recognition methods based on interest cognition is characterized in that comprising the steps:
1. the foundation and the maintenance of the cognitive knowledge base of personal interest
1.1 by collecting the mail of client's contact, cognitive client's live and work interest to greatest extent; Mail matter topics, the Mail Contents of client's mail are decomposed into word;
1.2 being that index is set up, upgraded and the storehouse that expands knowledge by 1.1 words that obtain, the word that does not have is added and by 1.3 registration attribute probability; The word that has had in the knowledge base is considered that just new incident recomputates and refreshes its attribute probability, realizes the accumulation and the renewal of knowledge base;
1.3 the attribute probability in the knowledge base is determined according to following rule and is refreshed: to words all in the mail that sends, all be designated as the sample that appears in the normal email; The word of butt joint in the receiving emails, in training period, the attribute record sample number of determining according to the client after finishing in training period, is charged to sample number according to the attribute of the system decision-making;
1.4 total sample number that the knowledge base word is occurred is set with threshold value, only when total sample number of each participle is higher than this threshold value, just allows the end training period;
2. probability calculation and mail attribute evaluation
2.1 calculate the conditional probability that each word occurs respectively in mail to be evaluated according to the attribute probability that draws in the total words and 1.2 and 1.3 that draws in 1.1;
2.2 utilize 2.1 result to utilize the Bayes formula to calculate the attribute probability of mail;
2.3 carry out attribute evaluation according to given decision-making value;
2.4 in training period, the result of decision of estimating the result of decision and client is compared and revises decision-making value, only when the result of decision and client's result of decision reach near the time, could finish training period;
3. result's output.
2, the spam recognition methods based on interest cognition according to claim 1, it is characterized in that: in the described step 1.1, the mail of client contact comprises in the mail of mail that the client sends and reception by other filtration system is qualitative and be the mail of spam, with the mail of client's transmission as normal email and have the highest weight; Corresponding fingerprint base is analyzed and set up to client's personal interest, the occurring once more or repeatedly occur and to revise the fingerprint of this participle of same participle.
3, the spam recognition methods based on interest cognition according to claim 1 is characterized in that: in the step 1.1, the mail that the client receives comprises two classes, and a kind of is interested normal email, and a kind of is uninterested spam; The training study that employing has a tutor is handled the mail that receives, and in training period, it is qualitative that the mail that described client receives will require the user to give, after training period, finish, by system calculate automatically and estimate give qualitative; To be removed to recomputate the attribute probability of this word as incident by the word of qualitative mail.
4, the spam recognition methods based on interest cognition according to claim 1, it is characterized in that: in the step 1.1, the described step that is decomposed into word is with the speech in the phrase of the mail matter topics of the mail of client's transmission and reception, Mail Contents keyword, phrase, sentence, the literary composition section separately.
5, the spam recognition methods based on interest cognition according to claim 1 is characterized in that: in the step 1.2, the accumulation of described knowledge base and renewal comprise two aspects: (A) interpolation of speech; When new mail entered, system was retrieved dictionary by the speech of new mail rapidly, when not having the speech of retrieval in the dictionary, just this speech and probability thereof was added in the knowledge base; (B) the attribute probability of speech upgrades; When new mail entered, system was retrieved dictionary by the speech of new mail rapidly, when the speech of existing retrieval in the dictionary, accessed with regard to previous probability, recomputated probability and by the probability of this speech in this refresh bank in conjunction with this incident.
6, the spam recognition methods based on interest cognition according to claim 1, it is characterized in that: training period described in the step 1.3, bring into use the moment of the present invention from the user, the end of so-called training period, two signs are arranged: one is all words in the envelope mail, and the total degree of the word that occurrence number is minimum is greater than a certain pre-set threshold; The 2nd, system evaluation and training period,, the artificial evaluation result degree of approximation surpassed another pre-set threshold; When an envelope mail satisfied above-mentioned two conditions, it is qualitative that system will not point out the client to carry out automatically; When the user had new social life hobby or work of transformation, the word that occurs in the mail can not satisfy above-mentioned condition, and system enters training period automatically.
7, the spam recognition methods based on interest cognition according to claim 1 is characterized in that: step 3. result output comprises the steps
3.1 when this method is used separately, in training period, show the mail attribute fingerprint of estimating; After training period finishes, spam is put into isolated area;
3.2 unite when using at this method and other anti-spam technologies, to specified interface output mail attribute fingerprint.
8, a kind of spam recognition system that realizes each described method of claim 1~7 based on interest cognition, it is characterized in that: comprise participle parts, spam probability calculation parts, knowledge base parts, evaluation of classification parts, attribute evaluation output block, described participle parts are connected with spam probability calculation parts and evaluation of classification parts simultaneously, spam probability calculation parts and evaluation of classification parts interconnect, both are connected with the knowledge base parts simultaneously, and described attribute evaluation output block is connected with the evaluation of classification parts.
CN200610124174A 2006-12-12 2006-12-12 Spam mail identify method based on interest cognition and system thereof Expired - Fee Related CN100583840C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200610124174A CN100583840C (en) 2006-12-12 2006-12-12 Spam mail identify method based on interest cognition and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200610124174A CN100583840C (en) 2006-12-12 2006-12-12 Spam mail identify method based on interest cognition and system thereof

Publications (2)

Publication Number Publication Date
CN1976323A CN1976323A (en) 2007-06-06
CN100583840C true CN100583840C (en) 2010-01-20

Family

ID=38126124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200610124174A Expired - Fee Related CN100583840C (en) 2006-12-12 2006-12-12 Spam mail identify method based on interest cognition and system thereof

Country Status (1)

Country Link
CN (1) CN100583840C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI569608B (en) * 2015-10-08 2017-02-01 網擎資訊軟體股份有限公司 A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101119341B (en) * 2007-09-20 2011-02-16 腾讯科技(深圳)有限公司 Mail identifying method and apparatus
CN110753024A (en) * 2018-07-23 2020-02-04 南京航空航天大学 Personalized mail re-filtering method in collective environment

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
一种基于文本分类技术的邮件过滤系统设计. 浦海晨,万晓东.科技广场,第2005年06期. 2005
一种基于文本分类技术的邮件过滤系统设计. 浦海晨,万晓东.科技广场,第06期. 2005 *
中文垃圾邮件过滤系统的实现和评估. 李星,田莹,段海新.大连理工大学学报,第45卷增刊卷. 2005
中文垃圾邮件过滤系统的实现和评估. 李星,田莹,段海新.大连理工大学学报,第45卷增刊卷. 2005 *
基于贝叶斯过滤算法的反垃圾邮件策略. 李闻天.昆明理工大学学报,第30卷第3期. 2005
基于贝叶斯过滤算法的反垃圾邮件策略. 李闻天.昆明理工大学学报,第30卷第3期. 2005 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI569608B (en) * 2015-10-08 2017-02-01 網擎資訊軟體股份有限公司 A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment

Also Published As

Publication number Publication date
CN1976323A (en) 2007-06-06

Similar Documents

Publication Publication Date Title
CN105740228B (en) A kind of internet public feelings analysis method and system
US8112484B1 (en) Apparatus and method for auxiliary classification for generating features for a spam filtering model
US7930353B2 (en) Trees of classifiers for detecting email spam
CN107315778A (en) A kind of natural language the analysis of public opinion method based on big data sentiment analysis
CN108874777A (en) A kind of method and device of text anti-spam
Ning et al. Spam message classification based on the Naïve Bayes classification algorithm
CN103136266A (en) Method and device for classification of mail
CN1983942A (en) Method and apparatus for identifying potential recipients
CN101784022A (en) Method and system for filtering and classifying short messages
CN101155182A (en) Garbage information filtering method and apparatus based on network
Sharma et al. Spam mails filtering using different classifiers with feature selection and reduction technique
Ma et al. A novel spam email detection system based on negative selection
CN100583840C (en) Spam mail identify method based on interest cognition and system thereof
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
Sirisanyalak et al. An artificial immunity-based spam detection system
CN114238738A (en) Rumor detection method based on attention mechanism and bidirectional GRU
WO2017094202A1 (en) Document structure analysis device which applies image processing
CN113177164A (en) Multi-platform collaborative new media content monitoring and management system based on big data
Daisy et al. Email Spam Behavioral Sieving Technique using Hybrid Algorithm
Islam et al. Machine learning approaches for modeling spammer behavior
alias Balamurugan et al. Data mining techniques for suspicious email detection: A comparative study
Glymin et al. Rough set approach to spam filter learning
Murugavel et al. K-Nearest neighbor classification of E-Mail messages for spam detection
Songkhla et al. Statistical rules for thai spam detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100120

Termination date: 20121212