CN105871887A - Client-side based personalized E-mail filtering system and method - Google Patents
Client-side based personalized E-mail filtering system and method Download PDFInfo
- Publication number
- CN105871887A CN105871887A CN201610316436.0A CN201610316436A CN105871887A CN 105871887 A CN105871887 A CN 105871887A CN 201610316436 A CN201610316436 A CN 201610316436A CN 105871887 A CN105871887 A CN 105871887A
- Authority
- CN
- China
- Prior art keywords
- feature
- spam
- filtering
- grader
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0236—Filtering by address, protocol, port number or service, e.g. IP-address or URL
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0245—Filtering by information in the payload
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0263—Rule management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a client-side based personalized E-mail filtering system and method. The system comprises a receiving module, a filtering and updating module and a display module. The receiving module receives E-mails and preprocesses the E-mails. The filtering and updating module comprises a database, a condition matcher and an intelligent detecting classifier. The database is a training data set. The condition matcher is used for a user to set filtering conditions, the E-mails are filtered according to the filtering conditions and then are detected and classified by utilizing the classifier, meanwhile the received E-mails are utilized to update the training data set of the classifier in real time, and therefore personalized E-mail filtering and classification are achieved. The display module displays E-mail filtering and classification results. By the adoption of the technical scheme, the filtering method is diversified and is good in performance, and the requirements for real-timeliness and personalization can be further met.
Description
Technical field
The present invention relates to mail filtering technology, particularly relate to a kind of client-based individual electronic mail filtering system
And filter method.
Background technology
Rubbish mail filtering method is mostly based on two kinds of feature extracting methods at present.Wherein, a kind of traditional statistics is relied on
Learn, by analyzing and processing the statistical information of Feature Words to be selected, it is sorted according to ga s safety degree, extracts ga s safety degree good
Feature Words;Although this method can extract substantial amounts of validity feature, but, owing to lacking the further place to these features
Reason, causes characteristic vector dimension too high, adds the complexity of calculating.
Another kind of based on artificial immune system, in conjunction with immunity thought, the generation process of simulation biological antibody, extract and possess
Suggestive feature.But, such method lays particular emphasis on the foundation of heuristic rule, and is less frequently utilized statistical theory analysis and is carried
The characteristic validity taken.
Current rubbish mail filtering method, many employing data with existing collection training, it is difficult to accomplish according to the postal received
Part carries out the real-time update of data set.The rubbish mail filtering method that existing Mail Clients uses is to enter at server end mostly
Row filters, and mail carries out classification display the most on the client.This method carrying out at server end filtering, needs to gather
After the service condition of a lot of users, just can carry out the renewal of mail data collection, thus cause real-time poor.Meanwhile, because
Server end is unified carries out filtrating mail, and the filtrating mail effect of all users is the most similar the most identical, causes user individual
Demand be difficult to be satisfied.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the present invention provides a kind of client-based individual electronic mail mistake
Filter system and filter method, by Computation immunity concentration feature, the mail received due to different user is different, uses at this
The method that ground client is trained study, user receives often envelope mail and can be updated training dataset, thus real
Existing individual electronic filtrating mail.
Present invention provide the technical scheme that
A kind of client-based individual electronic mail filtering system, including receiver module, filtration and more new module, shows
Show module;
Described receiver module is used for receiving mail, then the mail received carries out pretreatment, and is passed by pre-processed results
To filtering module;
Described filtration and more new module include data base, condition adapter and Intelligent Measurement grader;Data base is storage
At local training dataset;Condition adapter is used for user setup filtercondition, according to the filtercondition mail to receiving
Filter;Utilize the Intelligent Measurement grader mail to receiving to carry out detection classification simultaneously, obtain receiving dividing of mail
Class, and utilize the mail received that the training dataset of Intelligent Measurement grader is carried out real-time update, thus for each use
Its distinctive training dataset is set up at family so that grader during Intelligent Measurement is different because of user, is achieved in the electricity of personalization
Sub-filtrating mail is classified;
The result that electronic mail filtering is classified is shown by described display module.
The present invention specifically used JAVA language above-mentioned FTP client FTP of programming realization;By calling Waikato
The function library of Environment for Knowledge Analysis (Weka) realizes classifier training and classification.User sets
The filtercondition put includes key word filtercondition and sender address filtercondition etc..
The present invention also provides for a kind of client-based individual electronic mail filtering method, is divided into training stage and filtration
In the stage, this method, based on immunity concentration feature, uses the method being trained study in this locality, the often envelope received for user
Mail, by the real-time update of local data sets, obtains the training dataset of each user individual, it is achieved different user individual character
The filtrating mail requirement changed, thus solve the real-time of filtrating mail and personalized problem;Specifically include following steps:
1) in the training stage, execution following steps:
11) for existing e-mail data collection, according to quantity of information and the tendency degree of word, two class detector collection are generated
Close, respectively normal email detectors set and spam detection device set;
12) for existing e-mail data collection, step 11 is utilized) detectors set that builds, build immunity dense
Degree characteristic vector, obtains the immune Density feature vector that the often envelope mail of described e-mail data concentration is corresponding;
13) step 12 is utilized) often envelope immune Density feature vector corresponding to mail that obtain, train grader, instructed
The sorter model perfected;
2) at filtration stage, execution following steps:
21) docking receiving emails carries out pretreatment, resolves the often envelope mail received, obtains the mark of described mail
Topic, text, address of the addressee, sender address, described title, address of the addressee, sender address, filtercondition (bag is set
Include title filtercondition, transmitting-receiving address filtering condition etc.), it is used for carrying out mail classification;Described text is carried out participle, often seals postal
Part is each divided into multiple Feature Words;
22) docking receiving emails carries out categorical filtering, performs to operate as follows:
221) to the often envelope mail received, step 11 is utilized) detectors set that builds, the often envelope postal that will receive
Part reconstructs corresponding immunity Density feature vector, obtains the immune Density feature vector that the often envelope mail that receives is corresponding;
222) step 13 is utilized) mail classified by described sorter model, obtains classification results;
223) according to classification results and the filtercondition of user setup, docking receiving emails carries out filtration treatment, is filtered
Result;
23) carry out real-time update according to user interactive and show, including following situation:
23a) when receiving mail and being classified as spam, described mail enters " spam case ";
23b) when receiving mail and being classified as normal email, described mail enters " inbox ";
23c) when the user discover that spam case exists there is spam in normal email, or inbox time, use
The mail that mistake is divided can manually be reclassified by family;To the described mail reclassified, carry out word segmentation processing and obtain participle, proceed to step
Rapid 1) update detectors set with described participle, and rebuild immunity Density feature vector and training grader successively.
For above-mentioned filter method, further, step 11) quantity of information of institute's predicate and tendency degree screened by word respectively
Method and tendency degree computational methods are calculated;Institute's predicate screening technique is specifically:
For existing e-mail data collection, it is calculated information gain I (t) of all Feature Words by formula 1, and will
All Feature Words are ranked up according to the size of information gain I (t), and the Feature Words that sequence is positioned at front m% adds gene bank to;
In embodiments of the present invention, it is preferable that m value is 50.
In above formula, P (Ci) represent CiThe document of classification frequency in data set;P (t) represents in data set containing feature
The probability of the document of word t;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document exists
On the premise of Feature Words t occurs, it belongs to classification CiProbability;Represent on the premise of Feature Words t occurs without, this article
Shelves belong to classification CiProbability.
Described tendency degree calculates specifically: for each Feature Words in described gene bank, calculates this feature word at rubbish
The frequency occurred in the frequency occurred in mail and normal email;The frequency occurred in normal email when this feature word is more than this
During the frequency that Feature Words occurs in spam, normal email detectors set is charged in this feature word;When this feature word exists
During the frequency that the frequency occurred in spam occurs in normal email more than this feature word, rubbish postal is charged in this feature word
Part detectors set;When both frequencies are equal, any detectors set do not included in this feature word.Thus generate two class detectors
Set.
For above-mentioned filter method, further, step 12) described structure immunity Density feature vector, concrete grammar is:
The different characteristic word often sealing mail concentrating e-mail data occurs in spam detection device set and normal email inspection
The quantity surveying device set counts;If N represents often seals the number of different characteristic word in mail, S represents that often envelope goes out in sealing mail
The Feature Words quantity of now spam detection device set, L represents and often occurs in normal email detectors set in envelope mail
Feature Words quantity;Structure obtains a bivector, is denoted as (S/N, L/N), as immunity Density feature vector, thus obtains institute
State the immune Density feature vector that the often envelope mail of e-mail data concentration is corresponding.
For above-mentioned filter method, further, described grader uses support vector machines.
For above-mentioned filter method, further, in step 13) during described training, use QUADRATIC PROGRAMMING METHOD FOR
Grader is carried out parameter optimization.
Compared with prior art, the invention has the beneficial effects as follows:
Existing rubbish mail filtering method, uses data with existing collection training mostly, and data set seldom accomplishes root
Real-time update is carried out according to the mail received.Because the method that they use is to filter at server end, and in server
Data set need to gather the service condition of a lot of user after, be only possible to be updated.
The rubbish mail filtering method that the present invention provides is by often sealing mail structure immunity Density feature vector, it is possible to have
Effect extracts the feature of mail, thus improves classification performance, promotes Spam filtering effect.Based on immunity concentration feature method,
On the basis of having preferable filter effect, the feature that the mail that receives for each user is different, in this locality to often
Individual user sets up the mail-detection grader of personalization, thus realizes the Spam filtering client of a kind of personalization.Client
End system also includes other rule-based filter method, the such as method such as white list, key word so that filter method is various
Change, thus promote systematic entirety energy.The client that the present invention provides is to be trained study in this locality, and user receives and often seals
Mail, all can be updated training dataset.And mail that different user receives is different, so this local data sets
Real-time update, it is possible to achieve the personalized training dataset that each user is different, thus realize between different user
The Spam filtering effect of property;Solve real-time and personalized problem.
To sum up, the present invention provide technical scheme, on the one hand filter method variation, performance good (Spam filtering
The indexs such as accuracy rate, recall rate, F metric can reach more than 98%), on the other hand can meet real-time and personalized wanting
Ask.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the filter method based on immunity concentration feature that the present invention provides.
Fig. 2 is the structured flowchart of the spam FTP client FTP based on immunity concentration that the embodiment of the present invention realizes.
Fig. 3 is that in the embodiment of the present invention, FTP client FTP logs in later main interface sectional drawing.
Fig. 4 is the e-mail reading interface sectional drawing of FTP client FTP in the embodiment of the present invention.
Fig. 5 is that in the embodiment of the present invention, the filtering function of FTP client FTP arranges interface sectional drawing.
Detailed description of the invention
Below in conjunction with the accompanying drawings, further describe the present invention by embodiment, but limit the model of the present invention never in any form
Enclose.
The invention provides a kind of rubbish mail filtering method based on immunity concentration feature, propose a kind of new immunity dense
Degree feature extracting method, and apply this method to email client system.This system supports that multiple accounts log in simultaneously,
Read user mail from mail server, extract the concentration feature not sealing mail, and use the grader corresponding mail of generation to divide
Class result.Rubbish mail filtering method based on immunity concentration feature can be divided into training stage and filtration stage, and the training stage will
Training dataset input grader, learns the parameter of grader and optimizes, finally give the grader under optimal effectiveness;
The grader that training is obtained by filtration stage is used in this client the mail received;Concrete steps include:
S1) by existing Email set as data set, therefrom extracting immunity Density feature vector, input grader is also
It is trained and learns, generating sorter model;The embodiment of the present invention use SVM as grader;
S2), after each user gets the mail, the mail of each user is resolved respectively, obtain the title of mail, text and
Addressee and sender address;
S3) text of mail is carried out participle, generate immunity concentration feature according to the message body after participle, detector collection
Vector, and use the sorter model generated in S1 that mail is classified.
The filter method based on immunity concentration feature that the present invention provides be embodied as flow process as it is shown in figure 1, to reception
Each envelope mail, resolve respectively, obtain mail header, addresser address and message body.The postal obtained after parsing
The parts such as part title and addresser address, are filtered by the filtercondition of coupling user setup, filter including key word, send out
Part people's address filtering etc.;Mail body parts after parsing, builds immunity concentration feature after carrying out participle, calculates classification
Result.The classification results of the filtercondition of user setup, and grader the most at last combines, to the mail in FTP client FTP
Unification is filtered.The filter method provided according to the present invention, following example establish spam based on immunity concentration
FTP client FTP, this system uses JAVA language programming realization, have invoked the function library of Weka to realize classifier training and to divide
Class.Fig. 2 is the structured flowchart of the spam FTP client FTP based on immunity concentration that the embodiment of the present invention realizes, and system is main
Including three big modules: receiver module, filtering module and display module.The mail received is carried out pretreatment by receiver module, and
Pre-processed results is passed to filtering module.Filtering module passes through filtercondition and Intelligent Measurement sorting technique, receives user
Mail filters, simultaneously real-time update grader, it is achieved personalized classification.Filter result is shown by display module,
Spam enters spam case.It is as follows that system implements step:
The first step: build detectors set;
Detectors set (detection collection) is the set of a kind of detector, is divided into two kinds in the present invention, and one is spam
Detectors set, another kind is normal email detectors set.Wherein, by calculating the Feature Words tendency degree to two class mails,
The Feature Words that will be more towards occurring in spam is included into spam detection device set, will be more towards occurring in normally
Feature Words in mail is included into normal email detectors set.
At the generation phase of detectors set, groundwork is to combine word filtering algorithm with tendency degree function, root
According to the quantity of information of Feature Words, (using information gain as the measurement index of quantity of information in the present embodiment, concrete calculation sees below
Literary composition) and tendency degree generate the detectors set of two kinds.Specifically:
11) word screening technique: for existing e-mail data collection, special by obtaining each after message body word segmentation processing
Levy word.For the participle of message body, the detailed description of the invention in the present embodiment is, using each Chinese character as a Feature Words,
Each word is as a Feature Words, and such as, " city " is divided into " city " and " city " two Feature Words.After participle completes, often seal
Mail has been partitioned into N number of Feature Words.Calculating information gain I (t) of all Feature Words, its computing formula as shown in Equation 1, and is incited somebody to action
All Feature Words are ranked up according to the size of information gain I (t).In information gain being sorted, ranking is positioned at the feature of front m%
Word adds gene bank to, and experiment proves best results during m=50;
In above formula, CiRepresent the classification (normal email or spam) of mail i;P(Ci) represent CiClassification (normal email
Or spam) document frequency in data set;P (t) represents the probability of the document in data set containing Feature Words t;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document is before Feature Words t occurs
Putting, it belongs to classification CiProbability;Representing on the premise of Feature Words t occurs without, the document belongs to classification Ci's
Probability.
12) tendency degree calculates: for each Feature Words in gene bank, calculates this feature word in each detection collection (this reality
Executing example is spam and normal email) the middle frequency occurred.
The Feature Words that the frequency of occurrences is bigger in spam, charges to spam detection device set DSS;In normal email
The Feature Words that the middle frequency of occurrences is bigger, charges to normal email detectors set DSL;It is believed that frequency occurs in spam
The Feature Words that rate is bigger, should belong to spam detection device collection;Feature Words bigger in the frequency of occurrences in normal email, should
This belongs to normal email detector collection).
Second step: build immunity Density feature vector;
For existing e-mail data collection, counting often seals the different characteristic word of mail and occurs in spam detection device
Set DSSWith normal email detectors set DSLQuantity.If N represents the number of different characteristic word in every mail, S represents and often seals
Mail occurs in the Feature Words quantity of spam detection device set, L represent often envelope mail in occur in normal email detection
The Feature Words quantity of device set.The immune Density feature vector then built is a bivector: (S/N, L/N).
3rd step: training grader
Previous step has reconstructed the immune Density feature vector of correspondence by often sealing mail, utilizes these characteristic vectors to classification
Device is trained.Grader in the present embodiment selects support vector machine (SVM).During training, use quadratic programming side
Method carries out parameter optimization to sorter model.
4th step: the pretreatment of FTP client FTP docking receiving emails
Fig. 4 is the e-mail reading interface sectional drawing of FTP client FTP in the embodiment of the present invention, as shown in Figure 4, FTP client FTP
After getting the mail, mail is resolved, obtain the title of mail, text and addressee and sender address;Wherein title, receipts
Send out address and can carry out filtrating mail based on filtercondition by the filtercondition of user setup;After message body realizes participle,
The sorter model trained for previous step;
5th step: FTP client FTP carries out categorical filtering to mail
In previous step, the often envelope mail in FTP client FTP has been divided into multiple Feature Words.Open FTP client FTP
Filtering function, the most as shown in Figure 5.Now again with the detectors set built in the first step, according in second step
Method, reconstruct corresponding immunity Density feature vector by often sealing mail, and use the grader mould trained in the 3rd step
Mail is classified by type.Finally, according to classification results, and the filtercondition (ratio that user is to settings such as title, transmitting-receiving addresses
As mated whether sender address is present in blacklist list, or whether title contains some keyword), to client
Mail in system carries out filtration treatment, and its result shows the most as shown in Figure 3.
6th step: carry out real-time update according to user interactive feature
The filter result of previous step being shown, the Email being classified as spam enters " spam
Case ", normal email enters " inbox ".But, exist in normal email, or inbox when the user discover that in spam case
When there is spam, the mail that mistake is divided can manually be reclassified by user.Meanwhile, after these mail word segmentation processing, jump
Return the first step, update detectors set with these participles, and rebuild immunity Density feature vector and training classification successively
Device.The update method of detectors set, as a example by an envelope is manually labeled as the Email of spam by user: by this mail
In be not belonging to the Feature Words of normal email detector collection, all add spam detection device collection.In like manner, just manually it is labeled as
Often the Email of mail, is not belonging to the Feature Words of spam detection device collection, all adds normal email detector collection.
It should be noted that publicizing and implementing the purpose of example is that help is further appreciated by the present invention, but the skill of this area
Art personnel are understood that various substitutions and modifications are all without departing from the present invention and spirit and scope of the appended claims
Possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim
Book defines in the range of standard.
Claims (10)
1. a client-based individual electronic mail filtering system, is characterized in that, including receiver module, filters and updates
Module, display module;
Described receiver module is used for receiving mail, then the mail received is carried out pretreatment, and is passed to by pre-processed results
Filter module;
Described filtration and more new module include data base, condition adapter and Intelligent Measurement grader;Data base is for being stored in this
The training dataset on ground;Condition adapter is used for user setup filtercondition, carries out according to the filtercondition mail to receiving
Filtering, the recycling Intelligent Measurement grader mail to receiving carries out detection classification, obtains receiving the classification of mail;Simultaneously
Utilize the mail received that the local training dataset of Intelligent Measurement grader is carried out real-time update, be achieved in personalization
Electronic mail filtering is classified;
The result that electronic mail filtering is classified is shown by described display module.
2. individual electronic mail filtering system as claimed in claim 1, is characterized in that, specifically used JAVA language programming realization
Described individual electronic mail filtering system;Described Intelligent Measurement grader is realized by the function library calling Weka.
3. individual electronic mail filtering system as claimed in claim 1, is characterized in that, described user setup filtercondition includes
Key word filtercondition and sender address filtercondition.
4. a client-based individual electronic mail filtering method, including training stage and filtration stage;Described training
Training dataset is inputted grader by the stage, and the parameter of grader is learnt and optimized, and obtains the grader of optimum;Described
Training is obtained the mail that optimum grader is used in client receiving by filtration stage;Described mail filtering method is based on immunity
Concentration feature, by the real-time update of client local data sets, obtains the training dataset of each user individual, it is achieved no
Spam filtering requirement with user individual;Specifically include following steps:
1) in the training stage, execution following steps:
11) for existing e-mail data collection, quantity of information and tendency degree according to participle generate detectors set;Described inspection
Survey device set and include normal email detectors set and spam detection device set;
12) for existing e-mail data collection, step 11 is utilized) detectors set that builds, build immunity concentration special
Levy vector, obtain the immune Density feature vector that the often envelope mail of described e-mail data concentration is corresponding;
13) step 12 is utilized) often envelope immune Density feature vector corresponding to mail that obtain, train grader, trained
Sorter model;
2) at filtration stage, execution following steps:
21) docking receiving emails carries out pretreatment, resolves the often envelope mail received, and obtains the title, just of described mail
Literary composition, address of the addressee, sender address, described title, address of the addressee, sender address, filtercondition is set, is used for carrying out
Mail is classified;Described text carries out participle, and often envelope mail is each divided into multiple Feature Words;
22) docking receiving emails carries out categorical filtering, performs to operate as follows:
221) to the often envelope mail received, step 11 is utilized) detectors set that builds, the often envelope mail weight that will receive
Constitute corresponding immunity Density feature vector, obtain the immune Density feature vector that the often envelope mail that receives is corresponding;
222) step 13 is utilized) mail classified by described sorter model, and described classification includes spam or normal postal
Part, thus obtains classification results;
223) according to classification results and the filtercondition of user setup, then dock receiving emails and carry out filtration treatment, obtain further
Filtration treatment result;Described result is that described reception mail classifies as spam or normal email;
23) carry out real-time update according to user interactive and show.
5. mail filtering method as claimed in claim 4, is characterized in that, step 23) carry out the most more according to user interactive
New and display includes following situation:
23a) when receiving mail and being classified as spam, described mail enters " spam case ";
23b) when receiving mail and being classified as normal email, described mail enters " inbox ";
23c) when the user discover that " spam case " exists there is spam in normal email, or " inbox " time, use
The mail that mistake is divided can manually be reclassified by family;To the described mail reclassified, carry out word segmentation processing and obtain participle, proceed to step
Rapid 1) training stage, use described participle to update detectors set, and rebuild successively immunity Density feature vector and
Training grader.
6. mail filtering method as claimed in claim 4, is characterized in that, step 11) obtain according to quantity of information and the tendency degree of participle
Calculated by word screening technique and tendency degree computational methods respectively to detectors set, the quantity of information of described participle and tendency degree
Arrive;
Institute's predicate screening technique is specifically:
For existing e-mail data collection, it is calculated information gain I (t) of all Feature Words by formula 1, and will be all
Feature Words is ranked up according to the size of information gain I (t), and the Feature Words that sequence is positioned at front m% adds gene bank to;
(formula 1)
In above formula, P (Ci) represent CiThe document of classification frequency in data set;P (t) represents in data set containing Feature Words t's
The probability of document;Represent the probability of the document not containing Feature Words t in data set;P(Ci| t) represent that certain document is in feature
On the premise of word t occurs, it belongs to classification CiProbability;Representing on the premise of Feature Words t occurs without, the document belongs to
In classification CiProbability;
Described tendency degree calculates specifically: for each Feature Words in described gene bank, calculates this feature word at spam
The frequency occurred in the frequency of middle appearance and normal email;The frequency occurred in normal email when this feature word is more than this feature
During the frequency that word occurs in spam, normal email detectors set is charged in this feature word;When this feature word is at rubbish
During the frequency that the frequency occurred in mail occurs in normal email more than this feature word, this feature word is charged to spam inspection
Survey device set;Thus generate two class detectors sets.
7. mail filtering method as claimed in claim 6, is characterized in that, m value is 50.
8. mail filtering method as claimed in claim 4, is characterized in that, step 12) described structure immunity Density feature vector
Concrete grammar is:
The different characteristic word often sealing mail concentrating e-mail data occurs in spam detection device set and normal postal
The quantity of part detectors set counts;
If N represents often seals the number of different characteristic word in mail, S represents that often envelope occurs in spam detection device collection in sealing mail
The Feature Words quantity closed, L represents and often occurs in the Feature Words quantity of normal email detectors set in envelope mail;Structure obtains one
Individual bivector, is denoted as (S/N, L/N), as immunity Density feature vector, thus obtains what described e-mail data was concentrated
The immune Density feature vector that often envelope mail is corresponding.
9. mail filtering method as claimed in claim 4, is characterized in that, described grader uses support vector machines.
10. as claimed in claim 4 mail filtering method, is characterized in that, in step 13) during described training, use two
Secondary planing method carries out parameter optimization to grader.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610316436.0A CN105871887B (en) | 2016-05-12 | 2016-05-12 | Client-based individual electronic mail filtering system and filter method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610316436.0A CN105871887B (en) | 2016-05-12 | 2016-05-12 | Client-based individual electronic mail filtering system and filter method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105871887A true CN105871887A (en) | 2016-08-17 |
CN105871887B CN105871887B (en) | 2019-01-29 |
Family
ID=56631912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610316436.0A Expired - Fee Related CN105871887B (en) | 2016-05-12 | 2016-05-12 | Client-based individual electronic mail filtering system and filter method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105871887B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372237A (en) * | 2016-09-13 | 2017-02-01 | 新浪(上海)企业管理有限公司 | Fraudulent mail identification method and device |
CN106453423A (en) * | 2016-12-08 | 2017-02-22 | 黑龙江大学 | Spam filtering system and method based on user personalized setting |
CN108038189A (en) * | 2017-12-11 | 2018-05-15 | 南京茂毓通软件科技有限公司 | A kind of information extracting system of Email |
CN109039863A (en) * | 2018-08-01 | 2018-12-18 | 北京明朝万达科技股份有限公司 | A kind of mail security detection method, device and storage medium based on self study |
CN109831373A (en) * | 2019-03-01 | 2019-05-31 | 论客科技(广州)有限公司 | The anti-erroneous judgement method and device of mailing system high-precision intelligent based on FastText algorithm |
CN109918154A (en) * | 2017-12-07 | 2019-06-21 | 航天信息股份有限公司 | A kind of method and system pushing warning information in real time based on Attribute Association |
CN110268429A (en) * | 2017-02-10 | 2019-09-20 | 微软技术许可有限责任公司 | The automatic binding of Email content |
CN113343229A (en) * | 2021-06-30 | 2021-09-03 | 重庆广播电视大学重庆工商职业学院 | Network security protection system and method based on artificial intelligence |
CN113837154A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Open set filtering system and method based on multitask assistance |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059590A1 (en) * | 2006-09-05 | 2008-03-06 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method to filter electronic messages in a message processing system |
CN101316246A (en) * | 2008-07-18 | 2008-12-03 | 北京大学 | Junk mail detection method and system based on dynamic update of categorizer |
CN101330476A (en) * | 2008-07-02 | 2008-12-24 | 北京大学 | Method for dynamically detecting junk mail |
CN101594312A (en) * | 2008-05-30 | 2009-12-02 | 电子科技大学 | A kind of spam recognition methods and device based on artificial immunity and behavioural characteristic |
CN104156228A (en) * | 2014-04-01 | 2014-11-19 | 兰州工业学院 | Client-side short message filtration embedded feature library generating and updating method |
-
2016
- 2016-05-12 CN CN201610316436.0A patent/CN105871887B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059590A1 (en) * | 2006-09-05 | 2008-03-06 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method to filter electronic messages in a message processing system |
CN101594312A (en) * | 2008-05-30 | 2009-12-02 | 电子科技大学 | A kind of spam recognition methods and device based on artificial immunity and behavioural characteristic |
CN101330476A (en) * | 2008-07-02 | 2008-12-24 | 北京大学 | Method for dynamically detecting junk mail |
CN101316246A (en) * | 2008-07-18 | 2008-12-03 | 北京大学 | Junk mail detection method and system based on dynamic update of categorizer |
CN104156228A (en) * | 2014-04-01 | 2014-11-19 | 兰州工业学院 | Client-side short message filtration embedded feature library generating and updating method |
Non-Patent Citations (1)
Title |
---|
谭营等: "反垃圾电子邮件方法研究进展", 《智能系统学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372237A (en) * | 2016-09-13 | 2017-02-01 | 新浪(上海)企业管理有限公司 | Fraudulent mail identification method and device |
CN106453423A (en) * | 2016-12-08 | 2017-02-22 | 黑龙江大学 | Spam filtering system and method based on user personalized setting |
CN106453423B (en) * | 2016-12-08 | 2019-10-01 | 黑龙江大学 | A kind of filtration system and method for the spam based on user individual setting |
CN110268429A (en) * | 2017-02-10 | 2019-09-20 | 微软技术许可有限责任公司 | The automatic binding of Email content |
CN109918154A (en) * | 2017-12-07 | 2019-06-21 | 航天信息股份有限公司 | A kind of method and system pushing warning information in real time based on Attribute Association |
CN108038189A (en) * | 2017-12-11 | 2018-05-15 | 南京茂毓通软件科技有限公司 | A kind of information extracting system of Email |
CN109039863A (en) * | 2018-08-01 | 2018-12-18 | 北京明朝万达科技股份有限公司 | A kind of mail security detection method, device and storage medium based on self study |
CN109039863B (en) * | 2018-08-01 | 2021-06-22 | 北京明朝万达科技股份有限公司 | Self-learning-based mail security detection method and device and storage medium |
CN109831373A (en) * | 2019-03-01 | 2019-05-31 | 论客科技(广州)有限公司 | The anti-erroneous judgement method and device of mailing system high-precision intelligent based on FastText algorithm |
CN113343229A (en) * | 2021-06-30 | 2021-09-03 | 重庆广播电视大学重庆工商职业学院 | Network security protection system and method based on artificial intelligence |
CN113837154A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Open set filtering system and method based on multitask assistance |
CN113837154B (en) * | 2021-11-25 | 2022-03-25 | 之江实验室 | Open set filtering system and method based on multitask assistance |
Also Published As
Publication number | Publication date |
---|---|
CN105871887B (en) | 2019-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105871887A (en) | Client-side based personalized E-mail filtering system and method | |
CN101674264B (en) | Spam detection device and method based on user relationship mining and credit evaluation | |
Toolan et al. | Feature selection for spam and phishing detection | |
CN106453033B (en) | Multi-level process for sorting mailings based on Mail Contents | |
CN105447505B (en) | A kind of multi-level important email detection method | |
CN101699432B (en) | Ordering strategy-based information filtering system | |
CN103812872B (en) | A kind of network navy behavioral value method and system based on mixing Di Li Cray process | |
Katirai et al. | Filtering junk e-mail | |
CN101540017B (en) | Feature extracting method based on byte level n-gram and twit filter | |
CN106296195A (en) | A kind of Risk Identification Method and device | |
CN103136266A (en) | Method and device for classification of mail | |
CN102842078A (en) | Email forensic analyzing method based on community characteristics analysis | |
CN103886108A (en) | Feature selection and weight calculation method of imbalance text set | |
CN102404249A (en) | Method and device for filtering junk emails based on coordinated training | |
CN104933475A (en) | Network forwarding behavior prediction method and apparatus | |
CN106156105A (en) | Email polymerization sorting technique and device | |
CN109299251A (en) | A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm | |
Bhat et al. | Classification of email using BeaKS: Behavior and keyword stemming | |
CN101594314B (en) | Method for identifying image of junk e-mail based on high-order autocorrelation characteristic | |
CN105117466A (en) | Internet information screening system and method | |
Yeruva et al. | E-mail Spam Detection Using Machine Learning–KNN | |
Reddy et al. | Classification of Spam Messages using Random Forest Algorithm | |
Ergin et al. | Turkish anti-spam filtering using binary and probabilistic models | |
CN105337842B (en) | A kind of rubbish mail filtering method unrelated with content | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190129 |