CN105447505A - Multilevel important email detection method - Google Patents

Multilevel important email detection method Download PDF

Info

Publication number
CN105447505A
CN105447505A CN201510752497.7A CN201510752497A CN105447505A CN 105447505 A CN105447505 A CN 105447505A CN 201510752497 A CN201510752497 A CN 201510752497A CN 105447505 A CN105447505 A CN 105447505A
Authority
CN
China
Prior art keywords
mail
email
address
important
insignificant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510752497.7A
Other languages
Chinese (zh)
Other versions
CN105447505B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Shuzhilian Technology Co Ltd
Original Assignee
Chengdu Shuzhilian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Shuzhilian Technology Co Ltd filed Critical Chengdu Shuzhilian Technology Co Ltd
Priority to CN201510752497.7A priority Critical patent/CN105447505B/en
Publication of CN105447505A publication Critical patent/CN105447505A/en
Application granted granted Critical
Publication of CN105447505B publication Critical patent/CN105447505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a multilevel important email detection method that is established based on the information of the email address, the email subject, the email main body and the like. According to the method, firstly, the email address-based secondary feature of an email is extracted based on the bayes method. Secondly, the email subject-based secondary feature of the email is extracted based on the latent dirichlet allocation (LDA) and a support vector machine (SVM). Thirdly, the email main body-based secondary feature of the email is extracted based on the C4.5 and the SVM algorithm. Finally, based on the email address-based secondary feature of the email, the email subject-based secondary feature of the email and the email main body-based secondary feature of the email, a neural network model is trained. By means of the neural network model, the importance detection on the email is conducted. Therefore, the detection accuracy and the recall rate are higher.

Description

A kind of multi-level important email detection method
Technical field
The invention belongs to mail-detection technical field, more specifically say, relate to a kind of detection method of multi-level important email, be applicable to the application of the aspects such as important email detection, Spam filtering.
Background technology
Along with developing rapidly of Internet technology, undertaken communicating by internet also more and more frequent.And carry out communication by mail and become indispensable part in life, work, study.But, just while Email becomes a kind of indispensable important information media of communication gradually, also become a kind of commercial means and cause user to require a great deal of time from receiving the important email detecting oneself needs a large amount of mail.For above problem, at present more existing mail-detection algorithms, but its method is all more single, this causes testing result not accurate enough, especially when important email accounts for smaller, more difficultly meets application demand.Therefore improving the accuracy rate that important email detects, especially when important email accounts for smaller, is a hot issue of research at present.
In more existing solutions, comprise method, the method for Corpus--based Method study, the method etc. based on similarity cluster based on probability.Based on the method for probability, as the bayes method of classics, its principle calculates the conditional probability of each classification when this group property value given, and using the result of class label maximum for conditional probability as classification, the shortcoming adopting the method is that precondition generally can not meet; The method of Corpus--based Method study, as SVM, decision tree etc.SVM method is one of current reasonable process for sorting mailings, its principle is that mail attribute is mapped to higher dimensional space by kernel function, largest interval lineoid is set up in this higher dimensional space, the classification belonging to mail is decided according to the plane at mail place, its shortcoming is that the selection of kernel function has certain blindness, lack effective guidance, be difficult to select best kernel function for certain particular problem; Decision tree is a more efficient method, and its principle first property value is carried out discretize, then contributes by the value of discretize, carry out successively, until this branch meets predetermined requirement, otherwise continues, until this branches into single mail.Its shortcoming is easy to over-fitting.Based on the method for similarity cluster, as KNN, its principle calculates the distance between mail, and sample just thinks close to which classification which classification this envelope mail belongs to.Its shortcoming needs to calculate the distance between mail, and classification effectiveness is lower.
These methods have respective advantage, also have respective shortcoming simultaneously.When some accuracys rate require higher and important email relative with insignificant mail ratio greatly different, these methods can't meet the requirement of practical application.
Summary of the invention
The deficiency existed for prior art and defect, the invention provides a kind of multi-level mail-detection method utilizing the information such as email box address, mail matter topics and message body to set up.The method is set up secondary characteristics extraction model respectively for information such as email box address, theme, texts and utilizes this model to obtain secondary characteristics, then using the secondary characteristics that the obtains input neural network training model as neural network.This invention has been combined the method such as Bayes, LDA (implicit Di Li Cray distributes), SVM (support vector machine), decision tree, can reach good effect in detection important email.
Concrete steps of the present invention are as follows:
(1), mail pre-service
From the mail collected, important email and insignificant mail are randomly drawed common N by a certain percentage and are sealed mail, and stamp the label of " important email " or " insignificant mail " respectively according to the importance of mail itself.
(2) mail, for each envelope extracted, extracts three partial informations such as email address, mail matter topics and message body in mail by matching regular expressions algorithm or string matching algorithm.
(3), extract based on the right secondary characteristics of email address
(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail i, then the set of all email addresses of N envelope mail can be expressed as A=A 1∪ A 2∪ ... .. ∪ A n.Use freq +(a h, a l) represent that email address is to (a h, a l) at the number of times being labeled as the appearance of important email email address centering, use freq -(a h, a l) represent that email address is to (a h, a l) at the number of times being labeled as the centering appearance of insignificant email box address, wherein, a h, a l∈ A and a h, a lfrom the email address of same envelope mail.Email address can be obtained to (a according to following formula h, a l) respectively important email email address to set and insignificant email box address to set in occur ratio p +(a h, a l) and p -(a h, a l):
p + ( a h , a l ) = freq + ( a h , a l ) N
p - ( a h , a 1 ) = freq - ( a h , a l ) N
(3.2), use right set is included in important email email address to the part concentrated to represent the email address of the i-th envelope mail formation, is expressed as with right set is included in all insignificant email box addresses to the part in set to represent the email address of the i-th envelope mail formation, is expressed as then the i-th envelope mail is based on the right secondary characteristics f of email address i, 1can be calculated as:
f i , 1 = Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | + Π ( a h , a l ) ∈ A i - p - ( a h , a l ) | A i - |
Wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set.
(4) secondary characteristics based on mail matter topics, is extracted
(4.1), adopt Chinese character Words partition system to carry out participle to mail matter topics, from participle, choose noun, verb, adjective and adverbial word as Feature Words, obtain F Feature Words in mail.
(4.2) F the Feature Words, according to step (4.1) obtained adds up the word frequency dyad occurring this F Feature Words in the i-th envelope mail, obtains the vectorial X that N number of F ties up i=(x i, 1, x i, 2..., x i,F), 1≤i≤N, the vector matrix (TM) of the vectorial composing training mail of N number of F dimension n × F.First by vector matrix (TM) n × Fset up topic model as LDA (implicit Di Li Cray distribute) algorithm, identify the potential subject information of mail, obtain by the output of topic model the vectorial X ' that N number of T ties up i=(x ' i, 1, x ' i, 2..., x ' i,T), form output matrix (TM_SVM) n × T, wherein T is number of topics given in advance.Then the vector matrix (TM_SVM) will obtained n × Tas input, with the label of mail for target, utilize SVM (support vector machine) Algorithm for Training based on the disaggregated model of mail matter topics.Can obtain by this output based on the disaggregated model of mail matter topics the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f i, 2.
(5) secondary characteristics of message body information, is extracted
(5.1), message body pre-service
Chinese character segmenter system is adopted to carry out participle to message body.From participle, choose noun and verb alternatively Feature Words according to part of speech, and then obtain the candidate feature set of words of training mail, then according to following formula:
χ 2 ( t , c ) = N × ( A D - C B ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D )
Calculate the chi-square value of each candidate feature word, wherein, t represents candidate feature word, c represents classification (only having important and insignificant here), A represents the number of times that candidate feature word t occurs in c classification mail, and B represents the number of times that candidate feature word t occurs in non-c classification mail, and C represents the number of times occurring non-candidate Feature Words t in c classification mail, D represents the number of times occurring non-candidate Feature Words t in all non-c class mails, and N represents the size of training set.Get the Feature Words of front G large candidate feature word of chi-square value as subsequent treatment, can be filtered out those by the method and contribute little Feature Words to reduce the complexity calculated to classification.
(5.2), mail is just classified
According to G the Feature Words obtained in (5.1), calculate the tf-idf value dyad of the i-th envelope mail features word, obtain new vectorial Y i=(y i, 1, y i, 2..., y i,G), 1≤i≤N.By the vectorial Y obtained ias the input of decision Tree algorithms C4.5, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha in each leaf node to ensure that the overall recall rate of important email is in higher level.Train the first disaggregated model of an energy filtration fraction insignificant mail, by the first disaggregated model set up, classification of mail is divided into important email and insignificant mail two class.
(5.3), secondary characteristics is extracted
By the first disaggregated model of (5.2) step, mail is divided into important email and insignificant mail.For being judged as important mail, calculate in (5.1) step the Bayesian probability that G the Feature Words obtained belongs to important email and insignificant mail respectively, and Bayesian probability Feature Words being belonged to important email and belong to insignificant mail the ratio of Bayesian probability as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be directly 0 by G the whole assignment of Feature Words characteristic of correspondence value, carry out vectorization equally, the new vector Z obtained i=(z i, 1, z i, 2..., z i,G), 1≤i≤N, by vector Z ias the input of SVM algorithm, set up the disaggregated model based on message body using the true class label of mail as target.Can obtain by this output based on the disaggregated model of message body the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f i, 3.
(6) secondary characteristics modeling, is utilized
By the secondary characteristics f that step (3), step (4), step (5) obtain i, 1, f i, 2, f i, 3form new vectorial V i=(f i, 1, f i, 2, f i, 3), using the input of this vector as neural network, train hidden layer and only have one deck two nodes, output layer only has the neural network of a node, and the output interval of neural network is [0,1].If output valve is greater than threshold value θ, then this mail is important email, otherwise is inessential mail.
(7), important email detects
When predicting mail, application preceding step (3) (4) (5) obtains the secondary characteristics of mail based on address, mail matter topics, message body respectively, and the neural network classification model that recycling step (6) is set up detects mail to be identified.
Accompanying drawing explanation
The process flow diagram of Fig. 1 multi-level mail-detection method that to be the present invention be utilizes the information such as addresses of items of mail, mail matter topics and message body to set up.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described, so that those skilled in the art understands the present invention better.Requiring particular attention is that, in the following description, when perhaps the detailed description of known function and design can desalinate main contents of the present invention, these are described in and will be left in the basket here.
The mail that the present embodiment adopts comprises: the communication mail between the circular mail of International Academic Conference, the communication mail of international trade, enterprise, some advertisement matter, fishing mails etc. that network is propagated.According to actual conditions, using the mail of first three type as important email, rear two classes are as inessential mail.
In the example of this enforcement, the classification processing method of mail comprises the following steps:
(1), mail pre-service
N is randomly drawed from the mail collected 1(N 1=700) important email is sealed, N 2(N 2=7000) seal the common N (N=7700) of inessential mail and seal mail as training text, the N extracted is sealed mail to stamp the label of " important email " or " insignificant mail " respectively (important email is labeled as 1 here, insignificant mail is labeled as 0), the step ST1 in this step corresponding diagram 1.
(2), to each envelope mail, by matching regular expressions or string matching algorithm, email box address, mail matter topics and message body three partial information is extracted.The matching expression of addresses of items of mail is:
Reg_add=“/^\w+([\.-]?\w+)*\w+([\.-]?\w+)*(\.\w{2,3})+$/”;
Mail matter topics information is according to the subject mark occurred in mail and content marker extraction; Message body is delete the address information that matches and the remaining information of subject information.This step is the step ST2 in Fig. 1.
(3) extract based on the right secondary characteristics of email address
(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail i, then the set of all email addresses of N envelope mail can be expressed as A=A 1∪ A 2∪ ... .. ∪ A 7700, and calculate freq +(a h, a l) and freq -(a h, a l), wherein, a h, a l∈ A and a h, a lfrom the email address of same mail.Email address can be obtained to (a according to following formula h, a l) respectively important email email address to set and insignificant email box address to set in occur ratio p +(a h, a l) and p -(a h, a l):
p + ( a h , a l ) = freq + ( a h , a l ) N
p - ( a h , a l ) = freq - ( a h , a l ) N
(3.2), use right set is included in place important email email address to the part concentrated to represent the email address of the i-th envelope mail formation, can be expressed as with right set is included in all insignificant email box addresses to part in set to represent the email address of the i-th envelope mail formation, can be expressed as 1≤i≤7700, utilize following formula:
f i , 1 = Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | + Π ( a h , a l ) ∈ A i - p - ( a h , a l ) | A i - |
Calculate the secondary characteristics of the i-th envelope mail based on email address, wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set.This step is the step ST4 in Fig. 1.
(4) secondary characteristics based on mail matter topics is extracted
(4.1) adopt Chinese character Words partition system to carry out participle to the theme often sealing mail, from participle, choose noun, verb, adjective and adverbial word as Feature Words, obtain F=205 Feature Words of mail, this step is the step ST5 in Fig. 1.
(4.2) F the Feature Words statistics obtained according to step (4.1) often seals the word frequency dyad occurring this F Feature Words in mail, obtains the vectorial X that N number of F ties up i=(x i, 1, x i, 2..., x i, 205), 1≤i≤7700, the vector matrix (TM) of composing training mail 7700 × 205.By vector matrix (TM) 7700 × 205the theme modeling of number of topics T=12 is set up in input as LDA (implicit Di Li Cray distributes) algorithm, is identified the potential subject information of mail by theme modeling, by the output of topic model, obtains the vectorial X ' of N number of 12 dimensions i=(x ' i, 1, x ' i, 2..., x ' i, 12), form output matrix (TM_SVM) 7700 × 12.Then the vector matrix (TM_SVM) will obtained 7700 × 12as input, adopt gaussian kernel function, utilize SVM (support vector machine) Algorithm for Training based on the secondary characteristics extraction model of mail matter topics.The secondary characteristics (implication of this feature is the probability that mail belongs to important email) based on mail matter topics can be extracted by this model, be expressed as f i, 2.This step is the step ST6 in Fig. 1.
(5) secondary characteristics of message body information, is extracted
(5.1), message body pre-service
Adopt Chinese character segmenter system to carry out participle to message body, from participle, choose noun and verb alternatively Feature Words, and then obtain the candidate feature set of words of all training mails, then according to following formula:
χ 2 ( t , c ) = N × ( A D - C B ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D )
Add up the chi-square value of each candidate feature word, wherein, t represents all candidate feature words, c represents classification (only having important and insignificant here), A represents the number of times that candidate feature word t occurs in c classification mail, and B represents the number of times that candidate feature word t occurs in non-c classification mail, and C represents the number of times occurring non-candidate Feature Words t in c classification mail, D represents the number of times occurring non-candidate Feature Words t in all non-c class mails, and N represents number N=7700 that whole training set contains mail.The Feature Words that before getting in all candidate feature words, G=230 chi-square value is large, can filter out those by the method and contribute little Feature Words to reduce the complexity calculated to classification.This step is the step ST7 in Fig. 1.
(5.2), filtrating mail
According to G=230 the Feature Words obtained in (5.1), calculate the tf-idf value dyad of often sealing mail features word, obtain new vectorial Y i=(y i, 1, y i, 2..., y i, 230), 1≤i≤7700.By the vectorial Y obtained ias the input of decision Tree algorithms C4.5, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha=0.03 in each leaf node to ensure that the overall recall rate of important email is in higher level.By the first disaggregated model set up, classification of mail is divided into important email and insignificant mail two class.This step is the step ST8 in Fig. 1.
(5.3), secondary characteristics is extracted
By the first disaggregated model of (5.2) step, mail is divided into important email and insignificant mail.For being judged as important mail, calculate in (5.1) step the Bayesian probability that G the Feature Words obtained belongs to important email and insignificant mail respectively, and Bayesian probability Feature Words being belonged to important email and belong to insignificant mail the ratio of Bayesian probability as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be directly 0 by G the whole assignment of Feature Words characteristic of correspondence value, carry out vectorization equally, the new vector Z obtained i=(z i, 1, z i, 2..., z i, 230), 1≤i≤7700.By vector Z ias the input of SVM algorithm, set up the disaggregated model based on message body using the true class label of mail as target.Can obtain by this output based on the disaggregated model of message body the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f i, 3.This step is the step ST9 in Fig. 1.
(6) secondary characteristics modeling, is utilized
By the feature f that step (3), step (4), step (5) obtain i, 1, f i, 2, f i, 3form new vectorial V i=(f i, 1, f i, 2, f i, 3), 1≤i≤7700, using the input of this vector as neural network, train hidden layer and only have one deck to comprise two nodes, output layer only has the neural network of a node, the interval [0,1] of the output valve of neural network.When the value exported by neural network is greater than θ, be then judged as important email, otherwise be judged as insignificant mail, by training analysis, θ gets 0.53.This step is the step ST10 in Fig. 1.
(7), mail prediction
In order to verify the effect based on multi-level important email detection method, we have employed the method for cross validation, in 7700 envelope mails of previous processed, randomly draw 80% as training set, and 20% carries out cross validation as checking collection.Repeat 100 times.Average Accuracy is 86.2%, and average recall rate is 90.3%.
Can be found out under important email and the unbalanced situation of inessential mail ratio by this result, compared with other E-mail sorting models, improve 15%-20%.Which illustrate the present invention, in fields such as important email identifications, there is good using value.
Although be described the illustrative embodiment of the present invention above; so that those skilled in the art understand the present invention; but should be clear; the invention is not restricted to the scope of embodiment; to those skilled in the art; as long as various change to limit and in the spirit and scope of the present invention determined, these changes are apparent, and all innovation and creation utilizing the present invention to conceive are all at the row of protection in appended claim.

Claims (6)

1. a detection method for multi-level important email, is applicable to the embody rule such as important email detection, Spam filtering, has higher accuracy rate and recall rate, it is characterized in that, comprise the following steps:
(1), mail pre-service
From the mail collected, randomly draw N and seal mail, and stamp the label of " important email " or " insignificant mail " according to the importance of mail reality respectively;
(2) mail, for each envelope extracted, by the method for matching regular expressions algorithm or string matching, extracts three partial informations such as email address, mail matter topics and the message body in mail;
(3) secondary characteristics based on email address is extracted
(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail i, then the set of all email addresses of N envelope mail can be expressed as A=A 1∪ A 2∪ ... .. ∪ A n, use freq +(a h, a l) represent that email address is to (a h, a l) total degree that occurs in the address set that formed in important email, use freq -(a h, a l) represent that email address is to (a h, a l) total degree that occurs in the address set that formed in insignificant mail, wherein, a h, a l∈ A mailbox and address a h, a lfrom same envelope mail; By email address to (a h, a l) respectively important email email address in and insignificant email box address centering occur ratio p +(a h, a l) and p -(a h, a l) as the secondary characteristics based on email address, wherein p + ( a h , a l ) = freq + ( a h , a l ) N , p - ( a h , a l ) = freq - ( a h , a l ) N ;
(3.2), use right set is included in important email email address to the part in set to represent the email address of the i-th envelope mail formation, is expressed as with right set is included in insignificant email box address to part in set to represent the email address of the i-th envelope mail formation, is expressed as then the i-th envelope mail is based on the right secondary characteristics f of email address i, 1can be calculated as:
f i , 1 = Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | Π ( a h , a l ) ∈ A i + p + ( a h , a l ) | A i + | + Π ( a h , a l ) ∈ A i - p - ( a h , a l ) | A i - |
Wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set;
(4) secondary characteristics based on mail matter topics, is extracted
(4.1), adopt Chinese character Words partition system to carry out participle to mail matter topics, from point set of words, choose noun, verb, adjective and adverbial word as Feature Words, obtain F Feature Words of mail; Statistics often seals the word frequency dyad occurring this F Feature Words in mail, will obtain the vector matrix (TM) of the vectorial composing training mail that N number of F ties up n × Fand set up topic model as the input of LDA (implicit Di Li Cray distributes) algorithm, identify the hiding subject information of mail, and using the input of the N number of T dimension obtained from the training of LDA topic model (T be the theme number) vector (output of topic model) as SVM, using mail classes label as target, utilize SVM (support vector machine) Algorithm for Training based on the secondary characteristics extraction model of mail matter topics; The secondary characteristics of the i-th envelope mail can be extracted by this model, be expressed as f i, 2;
(5) secondary characteristics of message body information, is extracted
Adopt Chinese character segmenter system to carry out participle to message body, and calculate the chi-square value of each participle, choose the large participle of a front G chi-square value as Feature Words by the size of chi-square value; Calculate the tf-idf value dyad of often sealing corresponding this G Feature Words of mail, using the input as decision Tree algorithms C4.5 of the vector that obtains, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha in each leaf node to ensure that the overall recall rate of important email is in higher level; Train the filtering model of an energy filtration fraction insignificant mail; By the first disaggregated model using C4.5 algorithm to set up, mail is divided into important email and insignificant mail; For being judged as important mail, calculate the Bayesian probability that G Feature Words belongs to important email and insignificant mail respectively, and using the probability belonging to important email with belong to the ratio of probability of insignificant mail as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be 0 dyad by whole for the eigenwert of G Feature Words assignment; Using the input as SVM algorithm of the vector that obtains, set up the secondary characteristics extraction model based on message body using the true class label of mail as target; The i-th envelope mail can be extracted based on the secondary characteristics of message body by this model, be expressed as f i, 3;
(6) secondary characteristics modeling, is utilized
By the feature f that step (3), step (4), step (5) obtain i, 1, f i, 2, f i, 3form new vectorial V i=(f i, 1, f i, 2, f i, 3), using the input of this vector as neural network algorithm, train hidden layer and only have one deck two nodes, output layer only has the neural network of a node, judges whether mail is important by the output numerical values recited of output layer.
2. multi-level important email detection method according to claim 1, it is characterized in that mail being divided into addresses of items of mail, mail matter topics and message body three partial information described in step (2), then adopt diverse ways to process separately to this three partial information and obtain corresponding secondary characteristics.
3. multi-level important email detection method according to claim 1, is characterized in that the email address of method to mail of the utilization statistics described in step (3) is added up information, obtains the secondary characteristics based on email box address.
4. multi-level important email detection method according to claim 1, is characterized in that the secondary characteristics utilizing LDA and SVM to obtain based on mail matter topics described in step (4).
5. multi-level important email detection method according to claim 1, it is characterized in that the utilization card side described in step (5) carries out feature selecting, C4.5 algorithm is utilized tentatively to filter mail subsequently, recycling SVM algorithm sets up Further Feature Extraction modeling to message body, obtains the secondary characteristics based on message body.
6. multi-level important email detection method according to claim 1, is characterized in that the secondary characteristics utilizing step (3) (4) (5) to obtain described in step (6) trains a neural network model to classify to mail.
CN201510752497.7A 2015-11-09 2015-11-09 A kind of multi-level important email detection method Active CN105447505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510752497.7A CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510752497.7A CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Publications (2)

Publication Number Publication Date
CN105447505A true CN105447505A (en) 2016-03-30
CN105447505B CN105447505B (en) 2018-12-18

Family

ID=55557664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510752497.7A Active CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Country Status (1)

Country Link
CN (1) CN105447505B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955951A (en) * 2016-04-29 2016-09-21 中山大学 Message filtering method and device
CN106357508A (en) * 2016-08-31 2017-01-25 成都启力慧源科技有限公司 Email classification method based on user behavior relationships
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN107391565A (en) * 2017-06-13 2017-11-24 东南大学 A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN109543050A (en) * 2018-11-29 2019-03-29 北京航空航天大学 A kind of mail importance evaluation method of dialogue-based network
CN109635254A (en) * 2018-12-03 2019-04-16 重庆大学 Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model
CN109800433A (en) * 2019-01-24 2019-05-24 深圳市小满科技有限公司 Method, apparatus of filing, electronic equipment and medium based on two disaggregated model of mail
CN109800852A (en) * 2018-11-29 2019-05-24 电子科技大学 A kind of multi-modal spam filtering method
CN109902236A (en) * 2019-03-07 2019-06-18 成都数之联科技有限公司 A kind of spam page down method based on non-probability model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6778941B1 (en) * 2000-11-14 2004-08-17 Qualia Computing, Inc. Message and user attributes in a message filtering method and system
CN1790405A (en) * 2005-12-31 2006-06-21 钱德沛 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101345720A (en) * 2008-08-15 2009-01-14 浙江大学 Junk mail classification method based on partial match estimation
CN102024045A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Information classification processing method, device and terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6778941B1 (en) * 2000-11-14 2004-08-17 Qualia Computing, Inc. Message and user attributes in a message filtering method and system
CN1790405A (en) * 2005-12-31 2006-06-21 钱德沛 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101345720A (en) * 2008-08-15 2009-01-14 浙江大学 Junk mail classification method based on partial match estimation
CN102024045A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Information classification processing method, device and terminal

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955951B (en) * 2016-04-29 2018-12-11 中山大学 A kind of method and device of message screening
CN105955951A (en) * 2016-04-29 2016-09-21 中山大学 Message filtering method and device
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN106357508A (en) * 2016-08-31 2017-01-25 成都启力慧源科技有限公司 Email classification method based on user behavior relationships
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN106453033B (en) * 2016-08-31 2019-03-15 电子科技大学 Multi-level process for sorting mailings based on Mail Contents
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN107391565B (en) * 2017-06-13 2020-11-03 东南大学 Matching method of cross-language hierarchical classification system based on topic model
CN107391565A (en) * 2017-06-13 2017-11-24 东南大学 A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
CN109543050A (en) * 2018-11-29 2019-03-29 北京航空航天大学 A kind of mail importance evaluation method of dialogue-based network
CN109800852A (en) * 2018-11-29 2019-05-24 电子科技大学 A kind of multi-modal spam filtering method
CN109543050B (en) * 2018-11-29 2021-08-27 北京航空航天大学 Mail importance evaluation method based on session network
CN109635254A (en) * 2018-12-03 2019-04-16 重庆大学 Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model
CN109800433A (en) * 2019-01-24 2019-05-24 深圳市小满科技有限公司 Method, apparatus of filing, electronic equipment and medium based on two disaggregated model of mail
CN109800433B (en) * 2019-01-24 2023-11-10 深圳市小满科技有限公司 Filing method and device based on mail two-class model, electronic equipment and medium
CN109902236A (en) * 2019-03-07 2019-06-18 成都数之联科技有限公司 A kind of spam page down method based on non-probability model
CN109902236B (en) * 2019-03-07 2021-06-11 成都数之联科技有限公司 Junk web page degradation method based on non-probability model

Also Published As

Publication number Publication date
CN105447505B (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN105447505A (en) Multilevel important email detection method
CN112084335B (en) Social media user account classification method based on information fusion
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN103500175B (en) A kind of method based on sentiment analysis on-line checking microblog hot event
US9967321B2 (en) Meme discovery system
CN103795612A (en) Method for detecting junk and illegal messages in instant messaging
Song et al. Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection
CN105740382A (en) Aspect classification method for short comment texts
Hashida et al. Classifying sightseeing tweets using convolutional neural networks with multi-channel distributed representation
CN108241867B (en) Classification method and device
Tran et al. Spam detection in online classified advertisements
CN105183715A (en) Word distribution and document feature based automatic classification method for spam comments
Zandian et al. Feature extraction method based on social network analysis
CN105117466A (en) Internet information screening system and method
Mukherjee et al. Opinion spam detection: An unsupervised approach using generative models
CN105068986A (en) Method for filtering comment spam based on bidirectional iteration and automatically constructed and updated corpus
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
Mirza et al. Evaluating efficiency of classifier for email spam detector using hybrid feature selection approaches
Al Mansoori et al. Suspicious Activity Detection of Twitter and Facebook using Sentimental Analysis.
Salehi et al. Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection
Karimi Zandian et al. MEFUASN: a helpful method to extract features using analyzing social network for fraud detection
CN102799666B (en) Method for automatically categorizing texts of network news based on frequent term set
CN106874944A (en) A kind of measure of the classification results confidence level based on Bagging and outlier
Jain et al. A hybrid approach for spam filtering using local concentration based K-means clustering
Wang et al. An Opinion Spam Detection Method Based on Multi-Filters Convolutional Neural Network.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: The inventor has waived the right to be mentioned

Inventor before: The inventor has waived the right to be mentioned

GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 610041 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan

Patentee after: Chengdu shuzhilian Technology Co.,Ltd.

Address before: No.2, floor 4, building 1, Jule road crossing, Section 1, West 1st ring road, Wuhou District, Chengdu City, Sichuan Province 610041

Patentee before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd.