CN105871887A - Client-side based personalized E-mail filtering system and method - Google Patents

Client-side based personalized E-mail filtering system and method Download PDF

Info

Publication number
CN105871887A
CN105871887A CN201610316436.0A CN201610316436A CN105871887A CN 105871887 A CN105871887 A CN 105871887A CN 201610316436 A CN201610316436 A CN 201610316436A CN 105871887 A CN105871887 A CN 105871887A
Authority
CN
China
Prior art keywords
mail
feature
spam
filtering
grader
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610316436.0A
Other languages
Chinese (zh)
Other versions
CN105871887B (en
Inventor
谭营
高扬
米古月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201610316436.0A priority Critical patent/CN105871887B/en
Publication of CN105871887A publication Critical patent/CN105871887A/en
Application granted granted Critical
Publication of CN105871887B publication Critical patent/CN105871887B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0263Rule management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a client-side based personalized E-mail filtering system and method. The system comprises a receiving module, a filtering and updating module and a display module. The receiving module receives E-mails and preprocesses the E-mails. The filtering and updating module comprises a database, a condition matcher and an intelligent detecting classifier. The database is a training data set. The condition matcher is used for a user to set filtering conditions, the E-mails are filtered according to the filtering conditions and then are detected and classified by utilizing the classifier, meanwhile the received E-mails are utilized to update the training data set of the classifier in real time, and therefore personalized E-mail filtering and classification are achieved. The display module displays E-mail filtering and classification results. By the adoption of the technical scheme, the filtering method is diversified and is good in performance, and the requirements for real-timeliness and personalization can be further met.

Description

Client-based individual electronic mail filtering system and filter method
Technical field
The present invention relates to mail filtering technology, particularly relate to a kind of client-based individual electronic mail filtering system And filter method.
Background technology
Rubbish mail filtering method is mostly based on two kinds of feature extracting methods at present.Wherein, a kind of traditional statistics is relied on Learn, by analyzing and processing the statistical information of Feature Words to be selected, it is sorted according to ga s safety degree, extracts ga s safety degree good Feature Words;Although this method can extract substantial amounts of validity feature, but, owing to lacking the further place to these features Reason, causes characteristic vector dimension too high, adds the complexity of calculating.
Another kind of based on artificial immune system, in conjunction with immunity thought, the generation process of simulation biological antibody, extract and possess Suggestive feature.But, such method lays particular emphasis on the foundation of heuristic rule, and is less frequently utilized statistical theory analysis and is carried The characteristic validity taken.
Current rubbish mail filtering method, many employing data with existing collection training, it is difficult to accomplish according to the postal received Part carries out the real-time update of data set.The rubbish mail filtering method that existing Mail Clients uses is to enter at server end mostly Row filters, and mail carries out classification display the most on the client.This method carrying out at server end filtering, needs to gather After the service condition of a lot of users, just can carry out the renewal of mail data collection, thus cause real-time poor.Meanwhile, because Server end is unified carries out filtrating mail, and the filtrating mail effect of all users is the most similar the most identical, causes user individual Demand be difficult to be satisfied.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the present invention provides a kind of client-based individual electronic mail mistake Filter system and filter method, by Computation immunity concentration feature, the mail received due to different user is different, uses at this The method that ground client is trained study, user receives often envelope mail and can be updated training dataset, thus real Existing individual electronic filtrating mail.
Present invention provide the technical scheme that
A kind of client-based individual electronic mail filtering system, including receiver module, filtration and more new module, shows Show module;
Described receiver module is used for receiving mail, then the mail received carries out pretreatment, and is passed by pre-processed results To filtering module;
Described filtration and more new module include data base, condition adapter and Intelligent Measurement grader;Data base is storage At local training dataset;Condition adapter is used for user setup filtercondition, according to the filtercondition mail to receiving Filter;Utilize the Intelligent Measurement grader mail to receiving to carry out detection classification simultaneously, obtain receiving dividing of mail Class, and utilize the mail received that the training dataset of Intelligent Measurement grader is carried out real-time update, thus for each use Its distinctive training dataset is set up at family so that grader during Intelligent Measurement is different because of user, is achieved in the electricity of personalization Sub-filtrating mail is classified;
The result that electronic mail filtering is classified is shown by described display module.
The present invention specifically used JAVA language above-mentioned FTP client FTP of programming realization;By calling Waikato The function library of Environment for Knowledge Analysis (Weka) realizes classifier training and classification.User sets The filtercondition put includes key word filtercondition and sender address filtercondition etc..
The present invention also provides for a kind of client-based individual electronic mail filtering method, is divided into training stage and filtration In the stage, this method, based on immunity concentration feature, uses the method being trained study in this locality, the often envelope received for user Mail, by the real-time update of local data sets, obtains the training dataset of each user individual, it is achieved different user individual character The filtrating mail requirement changed, thus solve the real-time of filtrating mail and personalized problem;Specifically include following steps:
1) in the training stage, execution following steps:
11) for existing e-mail data collection, according to quantity of information and the tendency degree of word, two class detector collection are generated Close, respectively normal email detectors set and spam detection device set;
12) for existing e-mail data collection, step 11 is utilized) detectors set that builds, build immunity dense Degree characteristic vector, obtains the immune Density feature vector that the often envelope mail of described e-mail data concentration is corresponding;
13) step 12 is utilized) often envelope immune Density feature vector corresponding to mail that obtain, train grader, instructed The sorter model perfected;
2) at filtration stage, execution following steps:
21) docking receiving emails carries out pretreatment, resolves the often envelope mail received, obtains the mark of described mail Topic, text, address of the addressee, sender address, described title, address of the addressee, sender address, filtercondition (bag is set Include title filtercondition, transmitting-receiving address filtering condition etc.), it is used for carrying out mail classification;Described text is carried out participle, often seals postal Part is each divided into multiple Feature Words;
22) docking receiving emails carries out categorical filtering, performs to operate as follows:
221) to the often envelope mail received, step 11 is utilized) detectors set that builds, the often envelope postal that will receive Part reconstructs corresponding immunity Density feature vector, obtains the immune Density feature vector that the often envelope mail that receives is corresponding;
222) step 13 is utilized) mail classified by described sorter model, obtains classification results;
223) according to classification results and the filtercondition of user setup, docking receiving emails carries out filtration treatment, is filtered Result;
23) carry out real-time update according to user interactive and show, including following situation:
23a) when receiving mail and being classified as spam, described mail enters " spam case ";
23b) when receiving mail and being classified as normal email, described mail enters " inbox ";
23c) when the user discover that spam case exists there is spam in normal email, or inbox time, use The mail that mistake is divided can manually be reclassified by family;To the described mail reclassified, carry out word segmentation processing and obtain participle, proceed to step Rapid 1) update detectors set with described participle, and rebuild immunity Density feature vector and training grader successively.
For above-mentioned filter method, further, step 11) quantity of information of institute's predicate and tendency degree screened by word respectively Method and tendency degree computational methods are calculated;Institute's predicate screening technique is specifically:
For existing e-mail data collection, it is calculated information gain I (t) of all Feature Words by formula 1, and will All Feature Words are ranked up according to the size of information gain I (t), and the Feature Words that sequence is positioned at front m% adds gene bank to; In embodiments of the present invention, it is preferable that m value is 50.
In above formula, P (Ci) represent CiThe document of classification frequency in data set;P (t) represents in data set containing feature The probability of the document of word t;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document exists On the premise of Feature Words t occurs, it belongs to classification CiProbability;Represent on the premise of Feature Words t occurs without, this article Shelves belong to classification CiProbability.
Described tendency degree calculates specifically: for each Feature Words in described gene bank, calculates this feature word at rubbish The frequency occurred in the frequency occurred in mail and normal email;The frequency occurred in normal email when this feature word is more than this During the frequency that Feature Words occurs in spam, normal email detectors set is charged in this feature word;When this feature word exists During the frequency that the frequency occurred in spam occurs in normal email more than this feature word, rubbish postal is charged in this feature word Part detectors set;When both frequencies are equal, any detectors set do not included in this feature word.Thus generate two class detectors Set.
For above-mentioned filter method, further, step 12) described structure immunity Density feature vector, concrete grammar is: The different characteristic word often sealing mail concentrating e-mail data occurs in spam detection device set and normal email inspection The quantity surveying device set counts;If N represents often seals the number of different characteristic word in mail, S represents that often envelope goes out in sealing mail The Feature Words quantity of now spam detection device set, L represents and often occurs in normal email detectors set in envelope mail Feature Words quantity;Structure obtains a bivector, is denoted as (S/N, L/N), as immunity Density feature vector, thus obtains institute State the immune Density feature vector that the often envelope mail of e-mail data concentration is corresponding.
For above-mentioned filter method, further, described grader uses support vector machines.
For above-mentioned filter method, further, in step 13) during described training, use QUADRATIC PROGRAMMING METHOD FOR Grader is carried out parameter optimization.
Compared with prior art, the invention has the beneficial effects as follows:
Existing rubbish mail filtering method, uses data with existing collection training mostly, and data set seldom accomplishes root Real-time update is carried out according to the mail received.Because the method that they use is to filter at server end, and in server Data set need to gather the service condition of a lot of user after, be only possible to be updated.
The rubbish mail filtering method that the present invention provides is by often sealing mail structure immunity Density feature vector, it is possible to have Effect extracts the feature of mail, thus improves classification performance, promotes Spam filtering effect.Based on immunity concentration feature method, On the basis of having preferable filter effect, the feature that the mail that receives for each user is different, in this locality to often Individual user sets up the mail-detection grader of personalization, thus realizes the Spam filtering client of a kind of personalization.Client End system also includes other rule-based filter method, the such as method such as white list, key word so that filter method is various Change, thus promote systematic entirety energy.The client that the present invention provides is to be trained study in this locality, and user receives and often seals Mail, all can be updated training dataset.And mail that different user receives is different, so this local data sets Real-time update, it is possible to achieve the personalized training dataset that each user is different, thus realize between different user The Spam filtering effect of property;Solve real-time and personalized problem.
To sum up, the present invention provide technical scheme, on the one hand filter method variation, performance good (Spam filtering The indexs such as accuracy rate, recall rate, F metric can reach more than 98%), on the other hand can meet real-time and personalized wanting Ask.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the filter method based on immunity concentration feature that the present invention provides.
Fig. 2 is the structured flowchart of the spam FTP client FTP based on immunity concentration that the embodiment of the present invention realizes.
Fig. 3 is that in the embodiment of the present invention, FTP client FTP logs in later main interface sectional drawing.
Fig. 4 is the e-mail reading interface sectional drawing of FTP client FTP in the embodiment of the present invention.
Fig. 5 is that in the embodiment of the present invention, the filtering function of FTP client FTP arranges interface sectional drawing.
Detailed description of the invention
Below in conjunction with the accompanying drawings, further describe the present invention by embodiment, but limit the model of the present invention never in any form Enclose.
The invention provides a kind of rubbish mail filtering method based on immunity concentration feature, propose a kind of new immunity dense Degree feature extracting method, and apply this method to email client system.This system supports that multiple accounts log in simultaneously, Read user mail from mail server, extract the concentration feature not sealing mail, and use the grader corresponding mail of generation to divide Class result.Rubbish mail filtering method based on immunity concentration feature can be divided into training stage and filtration stage, and the training stage will Training dataset input grader, learns the parameter of grader and optimizes, finally give the grader under optimal effectiveness; The grader that training is obtained by filtration stage is used in this client the mail received;Concrete steps include:
S1) by existing Email set as data set, therefrom extracting immunity Density feature vector, input grader is also It is trained and learns, generating sorter model;The embodiment of the present invention use SVM as grader;
S2), after each user gets the mail, the mail of each user is resolved respectively, obtain the title of mail, text and Addressee and sender address;
S3) text of mail is carried out participle, generate immunity concentration feature according to the message body after participle, detector collection Vector, and use the sorter model generated in S1 that mail is classified.
The filter method based on immunity concentration feature that the present invention provides be embodied as flow process as it is shown in figure 1, to reception Each envelope mail, resolve respectively, obtain mail header, addresser address and message body.The postal obtained after parsing The parts such as part title and addresser address, are filtered by the filtercondition of coupling user setup, filter including key word, send out Part people's address filtering etc.;Mail body parts after parsing, builds immunity concentration feature after carrying out participle, calculates classification Result.The classification results of the filtercondition of user setup, and grader the most at last combines, to the mail in FTP client FTP Unification is filtered.The filter method provided according to the present invention, following example establish spam based on immunity concentration FTP client FTP, this system uses JAVA language programming realization, have invoked the function library of Weka to realize classifier training and to divide Class.Fig. 2 is the structured flowchart of the spam FTP client FTP based on immunity concentration that the embodiment of the present invention realizes, and system is main Including three big modules: receiver module, filtering module and display module.The mail received is carried out pretreatment by receiver module, and Pre-processed results is passed to filtering module.Filtering module passes through filtercondition and Intelligent Measurement sorting technique, receives user Mail filters, simultaneously real-time update grader, it is achieved personalized classification.Filter result is shown by display module, Spam enters spam case.It is as follows that system implements step:
The first step: build detectors set;
Detectors set (detection collection) is the set of a kind of detector, is divided into two kinds in the present invention, and one is spam Detectors set, another kind is normal email detectors set.Wherein, by calculating the Feature Words tendency degree to two class mails, The Feature Words that will be more towards occurring in spam is included into spam detection device set, will be more towards occurring in normally Feature Words in mail is included into normal email detectors set.
At the generation phase of detectors set, groundwork is to combine word filtering algorithm with tendency degree function, root According to the quantity of information of Feature Words, (using information gain as the measurement index of quantity of information in the present embodiment, concrete calculation sees below Literary composition) and tendency degree generate the detectors set of two kinds.Specifically:
11) word screening technique: for existing e-mail data collection, special by obtaining each after message body word segmentation processing Levy word.For the participle of message body, the detailed description of the invention in the present embodiment is, using each Chinese character as a Feature Words, Each word is as a Feature Words, and such as, " city " is divided into " city " and " city " two Feature Words.After participle completes, often seal Mail has been partitioned into N number of Feature Words.Calculating information gain I (t) of all Feature Words, its computing formula as shown in Equation 1, and is incited somebody to action All Feature Words are ranked up according to the size of information gain I (t).In information gain being sorted, ranking is positioned at the feature of front m% Word adds gene bank to, and experiment proves best results during m=50;
In above formula, CiRepresent the classification (normal email or spam) of mail i;P(Ci) represent CiClassification (normal email Or spam) document frequency in data set;P (t) represents the probability of the document in data set containing Feature Words t;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document is before Feature Words t occurs Putting, it belongs to classification CiProbability;Representing on the premise of Feature Words t occurs without, the document belongs to classification Ci's Probability.
12) tendency degree calculates: for each Feature Words in gene bank, calculates this feature word in each detection collection (this reality Executing example is spam and normal email) the middle frequency occurred.
The Feature Words that the frequency of occurrences is bigger in spam, charges to spam detection device set DSS;In normal email The Feature Words that the middle frequency of occurrences is bigger, charges to normal email detectors set DSL;It is believed that frequency occurs in spam The Feature Words that rate is bigger, should belong to spam detection device collection;Feature Words bigger in the frequency of occurrences in normal email, should This belongs to normal email detector collection).
Second step: build immunity Density feature vector;
For existing e-mail data collection, counting often seals the different characteristic word of mail and occurs in spam detection device Set DSSWith normal email detectors set DSLQuantity.If N represents the number of different characteristic word in every mail, S represents and often seals Mail occurs in the Feature Words quantity of spam detection device set, L represent often envelope mail in occur in normal email detection The Feature Words quantity of device set.The immune Density feature vector then built is a bivector: (S/N, L/N).
3rd step: training grader
Previous step has reconstructed the immune Density feature vector of correspondence by often sealing mail, utilizes these characteristic vectors to classification Device is trained.Grader in the present embodiment selects support vector machine (SVM).During training, use quadratic programming side Method carries out parameter optimization to sorter model.
4th step: the pretreatment of FTP client FTP docking receiving emails
Fig. 4 is the e-mail reading interface sectional drawing of FTP client FTP in the embodiment of the present invention, as shown in Figure 4, FTP client FTP After getting the mail, mail is resolved, obtain the title of mail, text and addressee and sender address;Wherein title, receipts Send out address and can carry out filtrating mail based on filtercondition by the filtercondition of user setup;After message body realizes participle, The sorter model trained for previous step;
5th step: FTP client FTP carries out categorical filtering to mail
In previous step, the often envelope mail in FTP client FTP has been divided into multiple Feature Words.Open FTP client FTP Filtering function, the most as shown in Figure 5.Now again with the detectors set built in the first step, according in second step Method, reconstruct corresponding immunity Density feature vector by often sealing mail, and use the grader mould trained in the 3rd step Mail is classified by type.Finally, according to classification results, and the filtercondition (ratio that user is to settings such as title, transmitting-receiving addresses As mated whether sender address is present in blacklist list, or whether title contains some keyword), to client Mail in system carries out filtration treatment, and its result shows the most as shown in Figure 3.
6th step: carry out real-time update according to user interactive feature
The filter result of previous step being shown, the Email being classified as spam enters " spam Case ", normal email enters " inbox ".But, exist in normal email, or inbox when the user discover that in spam case When there is spam, the mail that mistake is divided can manually be reclassified by user.Meanwhile, after these mail word segmentation processing, jump Return the first step, update detectors set with these participles, and rebuild immunity Density feature vector and training classification successively Device.The update method of detectors set, as a example by an envelope is manually labeled as the Email of spam by user: by this mail In be not belonging to the Feature Words of normal email detector collection, all add spam detection device collection.In like manner, just manually it is labeled as Often the Email of mail, is not belonging to the Feature Words of spam detection device collection, all adds normal email detector collection.
It should be noted that publicizing and implementing the purpose of example is that help is further appreciated by the present invention, but the skill of this area Art personnel are understood that various substitutions and modifications are all without departing from the present invention and spirit and scope of the appended claims Possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Book defines in the range of standard.

Claims (10)

1. a client-based individual electronic mail filtering system, is characterized in that, including receiver module, filters and updates Module, display module;
Described receiver module is used for receiving mail, then the mail received is carried out pretreatment, and is passed to by pre-processed results Filter module;
Described filtration and more new module include data base, condition adapter and Intelligent Measurement grader;Data base is for being stored in this The training dataset on ground;Condition adapter is used for user setup filtercondition, carries out according to the filtercondition mail to receiving Filtering, the recycling Intelligent Measurement grader mail to receiving carries out detection classification, obtains receiving the classification of mail;Simultaneously Utilize the mail received that the local training dataset of Intelligent Measurement grader is carried out real-time update, be achieved in personalization Electronic mail filtering is classified;
The result that electronic mail filtering is classified is shown by described display module.
2. individual electronic mail filtering system as claimed in claim 1, is characterized in that, specifically used JAVA language programming realization Described individual electronic mail filtering system;Described Intelligent Measurement grader is realized by the function library calling Weka.
3. individual electronic mail filtering system as claimed in claim 1, is characterized in that, described user setup filtercondition includes Key word filtercondition and sender address filtercondition.
4. a client-based individual electronic mail filtering method, including training stage and filtration stage;Described training Training dataset is inputted grader by the stage, and the parameter of grader is learnt and optimized, and obtains the grader of optimum;Described Training is obtained the mail that optimum grader is used in client receiving by filtration stage;Described mail filtering method is based on immunity Concentration feature, by the real-time update of client local data sets, obtains the training dataset of each user individual, it is achieved no Spam filtering requirement with user individual;Specifically include following steps:
1) in the training stage, execution following steps:
11) for existing e-mail data collection, quantity of information and tendency degree according to participle generate detectors set;Described inspection Survey device set and include normal email detectors set and spam detection device set;
12) for existing e-mail data collection, step 11 is utilized) detectors set that builds, build immunity concentration special Levy vector, obtain the immune Density feature vector that the often envelope mail of described e-mail data concentration is corresponding;
13) step 12 is utilized) often envelope immune Density feature vector corresponding to mail that obtain, train grader, trained Sorter model;
2) at filtration stage, execution following steps:
21) docking receiving emails carries out pretreatment, resolves the often envelope mail received, and obtains the title, just of described mail Literary composition, address of the addressee, sender address, described title, address of the addressee, sender address, filtercondition is set, is used for carrying out Mail is classified;Described text carries out participle, and often envelope mail is each divided into multiple Feature Words;
22) docking receiving emails carries out categorical filtering, performs to operate as follows:
221) to the often envelope mail received, step 11 is utilized) detectors set that builds, the often envelope mail weight that will receive Constitute corresponding immunity Density feature vector, obtain the immune Density feature vector that the often envelope mail that receives is corresponding;
222) step 13 is utilized) mail classified by described sorter model, and described classification includes spam or normal postal Part, thus obtains classification results;
223) according to classification results and the filtercondition of user setup, then dock receiving emails and carry out filtration treatment, obtain further Filtration treatment result;Described result is that described reception mail classifies as spam or normal email;
23) carry out real-time update according to user interactive and show.
5. mail filtering method as claimed in claim 4, is characterized in that, step 23) carry out the most more according to user interactive New and display includes following situation:
23a) when receiving mail and being classified as spam, described mail enters " spam case ";
23b) when receiving mail and being classified as normal email, described mail enters " inbox ";
23c) when the user discover that " spam case " exists there is spam in normal email, or " inbox " time, use The mail that mistake is divided can manually be reclassified by family;To the described mail reclassified, carry out word segmentation processing and obtain participle, proceed to step Rapid 1) training stage, use described participle to update detectors set, and rebuild successively immunity Density feature vector and Training grader.
6. mail filtering method as claimed in claim 4, is characterized in that, step 11) obtain according to quantity of information and the tendency degree of participle Calculated by word screening technique and tendency degree computational methods respectively to detectors set, the quantity of information of described participle and tendency degree Arrive;
Institute's predicate screening technique is specifically:
For existing e-mail data collection, it is calculated information gain I (t) of all Feature Words by formula 1, and will be all Feature Words is ranked up according to the size of information gain I (t), and the Feature Words that sequence is positioned at front m% adds gene bank to;
I G ( t ) = Σ i = 1 m P ( c i ) log P ( c i ) + P ( t ) Σ i = 1 m P ( c i | t ) log P ( c i | t ) + P ( t ‾ ) Σ i = 1 m P ( c i | t ‾ ) log P ( c i | t ‾ )
(formula 1)
In above formula, P (Ci) represent CiThe document of classification frequency in data set;P (t) represents in data set containing Feature Words t's The probability of document;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document is in feature On the premise of word t occurs, it belongs to classification CiProbability;Representing on the premise of Feature Words t occurs without, the document belongs to In classification CiProbability;
Described tendency degree calculates specifically: for each Feature Words in described gene bank, calculates this feature word at spam The frequency occurred in the frequency of middle appearance and normal email;The frequency occurred in normal email when this feature word is more than this feature During the frequency that word occurs in spam, normal email detectors set is charged in this feature word;When this feature word is at rubbish During the frequency that the frequency occurred in mail occurs in normal email more than this feature word, this feature word is charged to spam inspection Survey device set;Thus generate two class detectors sets.
7. mail filtering method as claimed in claim 6, is characterized in that, m value is 50.
8. mail filtering method as claimed in claim 4, is characterized in that, step 12) described structure immunity Density feature vector Concrete grammar is:
The different characteristic word often sealing mail concentrating e-mail data occurs in spam detection device set and normal postal The quantity of part detectors set counts;
If N represents often seals the number of different characteristic word in mail, S represents that often envelope occurs in spam detection device collection in sealing mail The Feature Words quantity closed, L represents and often occurs in the Feature Words quantity of normal email detectors set in envelope mail;Structure obtains one Individual bivector, is denoted as (S/N, L/N), as immunity Density feature vector, thus obtains what described e-mail data was concentrated The immune Density feature vector that often envelope mail is corresponding.
9. mail filtering method as claimed in claim 4, is characterized in that, described grader uses support vector machines.
10. as claimed in claim 4 mail filtering method, is characterized in that, in step 13) during described training, use two Secondary planing method carries out parameter optimization to grader.
CN201610316436.0A 2016-05-12 2016-05-12 Client-based individual electronic mail filtering system and filter method Expired - Fee Related CN105871887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610316436.0A CN105871887B (en) 2016-05-12 2016-05-12 Client-based individual electronic mail filtering system and filter method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610316436.0A CN105871887B (en) 2016-05-12 2016-05-12 Client-based individual electronic mail filtering system and filter method

Publications (2)

Publication Number Publication Date
CN105871887A true CN105871887A (en) 2016-08-17
CN105871887B CN105871887B (en) 2019-01-29

Family

ID=56631912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610316436.0A Expired - Fee Related CN105871887B (en) 2016-05-12 2016-05-12 Client-based individual electronic mail filtering system and filter method

Country Status (1)

Country Link
CN (1) CN105871887B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN106453423A (en) * 2016-12-08 2017-02-22 黑龙江大学 Spam filtering system and method based on user personalized setting
CN108038189A (en) * 2017-12-11 2018-05-15 南京茂毓通软件科技有限公司 A kind of information extracting system of Email
CN109039863A (en) * 2018-08-01 2018-12-18 北京明朝万达科技股份有限公司 A kind of mail security detection method, device and storage medium based on self study
CN109831373A (en) * 2019-03-01 2019-05-31 论客科技(广州)有限公司 The anti-erroneous judgement method and device of mailing system high-precision intelligent based on FastText algorithm
CN109918154A (en) * 2017-12-07 2019-06-21 航天信息股份有限公司 A kind of method and system pushing warning information in real time based on Attribute Association
CN110268429A (en) * 2017-02-10 2019-09-20 微软技术许可有限责任公司 The automatic binding of Email content
CN113343229A (en) * 2021-06-30 2021-09-03 重庆广播电视大学重庆工商职业学院 Network security protection system and method based on artificial intelligence
CN113837154A (en) * 2021-11-25 2021-12-24 之江实验室 Open set filtering system and method based on multitask assistance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059590A1 (en) * 2006-09-05 2008-03-06 Ecole Polytechnique Federale De Lausanne (Epfl) Method to filter electronic messages in a message processing system
CN101316246A (en) * 2008-07-18 2008-12-03 北京大学 Junk mail detection method and system based on dynamic update of categorizer
CN101330476A (en) * 2008-07-02 2008-12-24 北京大学 Method for dynamically detecting junk mail
CN101594312A (en) * 2008-05-30 2009-12-02 电子科技大学 A kind of spam recognition methods and device based on artificial immunity and behavioural characteristic
CN104156228A (en) * 2014-04-01 2014-11-19 兰州工业学院 Client-side short message filtration embedded feature library generating and updating method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059590A1 (en) * 2006-09-05 2008-03-06 Ecole Polytechnique Federale De Lausanne (Epfl) Method to filter electronic messages in a message processing system
CN101594312A (en) * 2008-05-30 2009-12-02 电子科技大学 A kind of spam recognition methods and device based on artificial immunity and behavioural characteristic
CN101330476A (en) * 2008-07-02 2008-12-24 北京大学 Method for dynamically detecting junk mail
CN101316246A (en) * 2008-07-18 2008-12-03 北京大学 Junk mail detection method and system based on dynamic update of categorizer
CN104156228A (en) * 2014-04-01 2014-11-19 兰州工业学院 Client-side short message filtration embedded feature library generating and updating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭营等: "反垃圾电子邮件方法研究进展", 《智能系统学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN106453423A (en) * 2016-12-08 2017-02-22 黑龙江大学 Spam filtering system and method based on user personalized setting
CN106453423B (en) * 2016-12-08 2019-10-01 黑龙江大学 A kind of filtration system and method for the spam based on user individual setting
CN110268429A (en) * 2017-02-10 2019-09-20 微软技术许可有限责任公司 The automatic binding of Email content
CN109918154A (en) * 2017-12-07 2019-06-21 航天信息股份有限公司 A kind of method and system pushing warning information in real time based on Attribute Association
CN108038189A (en) * 2017-12-11 2018-05-15 南京茂毓通软件科技有限公司 A kind of information extracting system of Email
CN109039863A (en) * 2018-08-01 2018-12-18 北京明朝万达科技股份有限公司 A kind of mail security detection method, device and storage medium based on self study
CN109039863B (en) * 2018-08-01 2021-06-22 北京明朝万达科技股份有限公司 Self-learning-based mail security detection method and device and storage medium
CN109831373A (en) * 2019-03-01 2019-05-31 论客科技(广州)有限公司 The anti-erroneous judgement method and device of mailing system high-precision intelligent based on FastText algorithm
CN113343229A (en) * 2021-06-30 2021-09-03 重庆广播电视大学重庆工商职业学院 Network security protection system and method based on artificial intelligence
CN113837154A (en) * 2021-11-25 2021-12-24 之江实验室 Open set filtering system and method based on multitask assistance
CN113837154B (en) * 2021-11-25 2022-03-25 之江实验室 Open set filtering system and method based on multitask assistance

Also Published As

Publication number Publication date
CN105871887B (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN105871887A (en) Client-side based personalized E-mail filtering system and method
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
Toolan et al. Feature selection for spam and phishing detection
CN106453033B (en) Multi-level process for sorting mailings based on Mail Contents
CN105447505B (en) A kind of multi-level important email detection method
CN101699432B (en) Ordering strategy-based information filtering system
CN103812872B (en) A kind of network navy behavioral value method and system based on mixing Di Li Cray process
Katirai et al. Filtering junk e-mail
CN101540017B (en) Feature extracting method based on byte level n-gram and twit filter
CN106296195A (en) A kind of Risk Identification Method and device
CN103136266A (en) Method and device for classification of mail
CN102842078A (en) Email forensic analyzing method based on community characteristics analysis
CN103886108A (en) Feature selection and weight calculation method of imbalance text set
CN102404249A (en) Method and device for filtering junk emails based on coordinated training
CN104933475A (en) Network forwarding behavior prediction method and apparatus
CN106156105A (en) Email polymerization sorting technique and device
CN109299251A (en) A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
Bhat et al. Classification of email using BeaKS: Behavior and keyword stemming
CN101594314B (en) Method for identifying image of junk e-mail based on high-order autocorrelation characteristic
CN105117466A (en) Internet information screening system and method
Yeruva et al. E-mail Spam Detection Using Machine Learning–KNN
Reddy et al. Classification of Spam Messages using Random Forest Algorithm
Ergin et al. Turkish anti-spam filtering using binary and probabilistic models
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190129