CN105447505B - A kind of multi-level important email detection method - Google Patents
A kind of multi-level important email detection method Download PDFInfo
- Publication number
- CN105447505B CN105447505B CN201510752497.7A CN201510752497A CN105447505B CN 105447505 B CN105447505 B CN 105447505B CN 201510752497 A CN201510752497 A CN 201510752497A CN 105447505 B CN105447505 B CN 105447505B
- Authority
- CN
- China
- Prior art keywords
- insignificant
- address
- important
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of multi-level important email detection method established using information such as mail address, mail matter topics and message bodies, this method combines bayes method to extract secondary characteristics of the mail based on email address first;Then secondary characteristics of the mail based on mail matter topics are extracted using LDA (implicit Di Li Cray distribution) and SVM (support vector machines) algorithm;Secondly secondary characteristics of the mail based on message body are extracted using C4.5 and SVM algorithm;Three kind secondary characteristics training neural network model of the mail based on email address, mail matter topics, message body finally extracted using front, is carried out importance to mail using the model and detects accuracy rate and recall rate with higher.
Description
Technical field
The invention belongs to mail-detection technical fields, more specifically, are related to a kind of detection of multi-level important email
Method, the application suitable for important email detection, Spam filtering etc..
Background technique
With the rapid development of Internet technology, communicated also more and more frequently by internet.And by mail into
Row communication had changed into life, work, learnt in indispensable part.However, being just increasingly becoming one kind in Email
While indispensable important information media of communication, when also becoming a kind of commercial means causes user to need to spend a large amount of
Between from receive detected in a large amount of mails oneself need important email.In view of the above problems, having some mail-detections at present
Algorithm, but its method is all relatively simple, and this causes testing result not accurate enough, especially in the lesser situation of important email accounting
Under, it more difficult to meet application demand.Therefore the accuracy rate for improving important email detection, especially in the lesser feelings of important email accounting
It is the hot issue studied at present under condition.
In existing some solutions, including method based on probability, the method based on statistical learning, based on similar
Spend the method etc. of cluster.Method based on probability, such as classical bayes method, principle are to calculate each classification to give this
Conditional probability when group attribute value, and using the maximum class label of conditional probability as classification as a result, using the disadvantages of this method
It is that precondition is not able to satisfy generally;Method based on statistical learning, such as SVM, decision tree.SVM method is relatively good at present
One of process for sorting mailings, principle is that mail attribute is mapped to higher dimensional space by kernel function, in this higher dimensional space
In establish largest interval hyperplane, classification belonging to mail is determined according to the plane where mail, the disadvantage is that kernel function
Selection has certain blindness, lacks effective guidance, is difficult to select optimal kernel function for some particular problem;Decision tree
It is a more efficient method, principle is that attribute value is first carried out discretization, then contributes by the value of discretization,
It successively carries out, until the branch meets scheduled requirement, otherwise continues, until this branches into single mail.The disadvantage is that
It is easy to over-fitting.Based on the method for similarity cluster, such as KNN, principle is to calculate the distance between mail, and sample is from which
Classification is closely considered as which classification the envelope mail belongs to.The disadvantage is that needing to calculate the distance between mail, classification effectiveness is lower.
These methods have respective advantage, while also having the shortcomings that respective.It is more demanding and important in some accuracys rate
In the case where mail and insignificant mail ratio great disparity relatively, these methods are not met by the requirement of practical application.
Summary of the invention
In view of the shortcomings of the prior art and defect, email box address, mail master are utilized the present invention provides a kind of
The multi-level mail-detection method that the information such as topic and message body are established.This method is for email box address, theme, text etc.
Information establishes secondary characteristics respectively and extracts model and obtain secondary characteristics using the model, then using obtained secondary characteristics as mind
Input training neural network model through network.Bayes, LDA (implicit Di Li Cray distribution), SVM has been used in combination in the invention
The methods of (support vector machines), decision tree can reach preferable effect in detection important email.
The specific steps of the present invention are as follows:
(1), mail pre-processes
From the mail being collected into, important email and insignificant mail randomly select total N envelope mail, and root by a certain percentage
Stamp the label of " important email " or " insignificant mail " respectively according to the importance of mail itself.
(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or string matching algorithm
Take the three parts information such as email address, mail matter topics and the message body in mail.
(3), the secondary characteristics based on email address pair are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then N seals the collection of all email addresses of mail
Conjunction is represented by A=A1∪A2∪.....∪AN.Use freq+(ah,al) indicate email address to (ah,al) it is being labeled as important postal
The number that part email address centering occurs, uses freq-(ah,al) indicate email address to (ah,al) it is being labeled as insignificant mail
The number that email address centering occurs, wherein ah,al∈ A and ah,alEmail address from same envelope mail.According to following
Formula can find out email address to (ah,al) respectively in important email email address to set and insignificant email box address pair
The ratio p occurred in set+(ah,al) and p-(ah,al):
(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address pair
The part of concentration, is expressed asWithIndicate the collection for the email address pair that the i-th envelope mail is constituted
Conjunction includes all insignificant email box addresses to the part in set, is expressed asThen
Secondary characteristics f of the i-th envelope mail based on email address pairi,1It can calculate are as follows:
WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set
Mesh,The i-th envelope email box address is represented to including insignificant email box address to number in set.
(4), the secondary characteristics based on mail matter topics are extracted
(4.1), mail matter topics are segmented using Chinese character Words partition system, noun, verb, adjective is chosen from participle
With adverbial word as Feature Words, F Feature Words in mail are obtained.
(4.2), the F Feature Words obtained according to step (4.1), which count in the i-th envelope mail, there is the word of this F Feature Words
Frequency dyad obtains the vector X of N number of F dimensioni=(xi,1,xi,2,...,xi,F), 1≤i≤N, the vector composing training of N number of F dimension
The vector matrix (TM) of mailN×F.First by vector matrix (TM)N×FFoundation master as LDA (implicit Di Li Cray distribution) algorithm
Model is inscribed, identifies the potential subject information of mail, the vector X ' of N number of T dimension is obtained by the output of topic modeli=(x 'i,1,
x′i,2,...,x′i,T), it constitutes output matrix (TM_SVM)N×T, wherein T is previously given number of topics.Then will obtain to
Moment matrix (TM_SVM)N×TAs input, using the label of mail as target, postal is based on using the training of SVM (support vector machines) algorithm
The disaggregated model of part theme.Belonged to by the available i-th envelope mail of the output of this disaggregated model based on mail matter topics important
The probability of mail, and the secondary characteristics using the probability as the mail based on mail matter topics, are expressed as fi,2。
(5), the secondary characteristics of message body information are extracted
(5.1), message body pre-processes
Message body is segmented using Chinese character segmenter system.Noun is chosen from participle according to part of speech and verb is made
For candidate feature word, and then obtain training the candidate feature set of words of mail, then according to the following formula:
Calculate the chi-square value of each candidate feature word, wherein t represents candidate feature word, and c represents classification (to be only had herein
It is important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t in non-c class
The number occurred in other mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c class mails
The middle number for non-candidate Feature Words t occur, N represent the size of training set.After the preceding G candidate feature word for taking chi-square value big is used as
The Feature Words of continuous processing can filter out those by this method and contribute small Feature Words to classification to reduce the complexity of calculating
Degree.
(5.2), mail is just classified
The G Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of the i-th envelope mail features word, obtain
New vector Yi=(yi,1,yi,2,...,yi,G), 1≤i≤N.The vector Y that will be obtainediAs the input of decision Tree algorithms C4.5,
Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as insignificant mail node,
Higher level is in the recall rate for guaranteeing important email totality by threshold alpha in each leaf node of control.Train one
Can the insignificant mail of filtration fraction first disaggregated model, by the first disaggregated model of foundation by mail be divided into important email with
Insignificant two class of mail.
(5.3), secondary characteristics are extracted
Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement
The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general
Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair
Answer the characteristic value dyad of Feature Words;It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail
Portion is assigned a value of 0, equally progress vectorization, obtained new vector Zi=(zi,1,zi,2,...,zi,G), 1≤i≤N, by vector Zi
As the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target.Pass through
The available i-th envelope mail of output of this disaggregated model based on message body belongs to the probability of important email, and by the probability
Secondary characteristics as the mail based on mail matter topics, are expressed as fi,3。
(6), it is modeled using secondary characteristics
The secondary characteristics f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi=
(fi,1,fi,2,fi,3), using the vector as the input of neural network, trains hidden layer and there was only one layer of two node, output layer
The neural network of only one node, the output interval of neural network are [0,1].If output valve is greater than threshold θ, which is
Otherwise important email is inessential mail.
(7), important email detects
When predicting mail, mail is respectively obtained using preceding step (3) (4) (5) and is based on address, mail master
Topic, the secondary characteristics of message body, the neural network classification model that recycle step (6) is established examine mail to be identified
It surveys.
Detailed description of the invention
Fig. 1 is that the present invention is examined using the multi-level mail that the information such as mail address, mail matter topics and message body are established
The flow chart of survey method.
Specific embodiment
A specific embodiment of the invention is described with reference to the accompanying drawing, preferably so as to those skilled in the art
Understand the present invention.Requiring particular attention is that in the following description, when known function and the detailed description of design perhaps
When can desalinate main contents of the invention, these descriptions will be ignored herein.
The present embodiment use mail be the circular mail for including: International Academic Conference, the communication mail of international trade,
Communication mail between enterprise, some advertisement matters, the fishing mail etc. propagated on network.According to the actual situation, by first three
The mail of type is as important email, and rear two class is as inessential mail.
In the example of this implementation, the classification processing method of mail the following steps are included:
(1), mail pre-processes
N is randomly selected from the mail being collected into1(N1=700) important email, N are sealed2(N2=7000) inessential mail is sealed
Total N (N=7700) envelope mail stamps " important email " or " insignificant postal as training text, by the N extracted envelope mail respectively
(important email is labeled as 1 to the label of part " here, and 0) insignificant mail is labeled as, the step ST1 in the step corresponding diagram 1.
(2), to each envelope mail, by regular expression matching or string matching algorithm, extract email box address,
Mail matter topics and message body three parts information.The matching expression of mail address are as follows:
Reg_add="/^ w+ ([.-]? w+) *@w+ ([.-]? w+) * (w { 2,3 })+$/";
Mail matter topics information is extracted according to the subject label and content label that occur in mail;Message body is to delete
Fall the address information matched and the remaining information of subject information.The step is the step ST2 in Fig. 1.
(3) secondary characteristics based on email address pair are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then N seals the collection of all email addresses of mail
Conjunction is represented by A=A1∪A2∪.....∪A7700, and calculate freq+(ah,al) and freq-(ah,al), wherein ah,al∈ A is simultaneously
And ah,alEmail address from same mail.Email address can be found out according to the following formula to (ah,al) respectively in important postal
Part email address is to set and insignificant email box address to the ratio p occurred in set+(ah,al) and p-(ah,al):
(3.2), it usesIndicate the set for the email address pair that the i-th envelope mail is constituted included in place important email mailbox
Location is represented by the part of concentrationWithIndicate the email address pair that the i-th envelope mail is constituted
Set include all insignificant email box addresses to part in set, be represented by
1≤i≤7700 utilize following formula:
The i-th secondary characteristics of the envelope mail based on email address are calculated, whereinRepresent the email address in the i-th envelope mail
To including important email email address to the number in set,It is non-heavy to being included in represent the i-th envelope email box address
Want email box address to number in set.The step is the step ST4 in Fig. 1.
(4) secondary characteristics based on mail matter topics are extracted
(4.1) theme of every envelope mail is segmented using Chinese character Words partition system, chosen from participle noun, verb,
Adjective and adverbial word obtain F=205 Feature Words of mail, which is the step ST5 in Fig. 1 as Feature Words.
(4.2) it is counted according to the F Feature Words that step (4.1) obtains and occurs the word frequency of this F Feature Words in every envelope mail
Dyad obtains the vector X of N number of F dimensioni=(xi,1,xi,2,...,xi,205), 1≤i≤7700, the vector of composing training mail
Matrix (TM)7700×205.By vector matrix (TM)7700×205Theme is established in input as LDA (implicit Di Li Cray distribution) algorithm
The theme modeling of number T=12 identifies that the potential subject information of mail is obtained by the output of topic model by theme modeling
To the vector X ' of N number of 12 dimensioni=(x 'i,1,x′i,2,...,x′i,12), it constitutes output matrix (TM_SVM)7700×12.Then it will obtain
Vector matrix (TM_SVM)7700×12As input, using gaussian kernel function, SVM (support vector machines) algorithm training base is utilized
Model is extracted in the secondary characteristics of mail matter topics.By the model can extract the secondary characteristics based on mail matter topics (this feature
Meaning is the probability that mail belongs to important email), it is expressed as fi,2.The step is the step ST6 in Fig. 1.
(5), the secondary characteristics of message body information are extracted
(5.1), message body pre-processes
Message body is segmented using Chinese character segmenter system, noun and verb are chosen from participle as candidate special
Word is levied, and then obtains the candidate feature set of words of all trained mails, then according to the following formula:
Count the chi-square value of each candidate feature word, wherein t represents all candidate feature words, and c represents classification (herein
It is only important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t and exists
The number occurred in non-c classification mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c
Occurs the number of non-candidate Feature Words t in class mail, N represents number N=7700 that entire training set contains mail.In all times
The Feature Words that G=230 chi-square value is big before taking in Feature Words are selected, it is small to classification contribution to filter out those by this method
Feature Words are to reduce the complexity of calculating.The step is the step ST7 in Fig. 1.
(5.2), filtrating mail
The G=230 Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of every envelope mail features word,
Obtain new vector Yi=(yi,1,yi,2,...,yi,230), 1≤i≤7700.The vector Y that will be obtainediAs decision Tree algorithms
Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as non-heavy by the input of C4.5
Want mail node, by controlling threshold alpha=0.03 in each leaf node with guarantee the recall rate of important email totality in compared with
High level.Mail is divided into two class of important email and insignificant mail by the first disaggregated model of foundation.The step is Fig. 1
In step ST8.
(5.3), secondary characteristics are extracted
Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement
The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general
Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair
Answer the characteristic value dyad of Feature Words;It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail
Portion is assigned a value of 0, equally progress vectorization, obtained new vector Zi=(zi,1,zi,2,...,zi,230), 1≤i≤7700.It will be to
Measure ZiAs the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target.
Belong to the probability of important email by the available i-th envelope mail of the output of this disaggregated model based on message body, and should
Secondary characteristics of the probability as the mail based on mail matter topics, are expressed as fi,3.The step is the step ST9 in Fig. 1.
(6), it is modeled using secondary characteristics
The feature f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi=(fi,1,
fi,2,fi,3), 1≤i≤7700 train hidden layer and only have to include two sections using the vector as the input of neural network
Point, the neural network of only one node of output layer, the section [0,1] of the output valve of neural network.When defeated by neural network
Value out is greater than θ, then is judged as important email, is otherwise judged as insignificant mail, by training analysis, θ takes 0.53.The step
For the step ST10 in Fig. 1.
(7), mail is predicted
In order to verify the effect based on multi-level important email detection method, the method that we use cross validation,
80% is randomly selected in 7700 envelope mails of previous processed as training set, 20% as verifying collection progress cross validation.It repeats
It carries out 100 times.Average Accuracy is 86.2%, and average recall rate is 90.3%.
It can be seen that under important email and the inessential unbalanced situation of mail ratio by the result, with other mails
Disaggregated model is compared, and 15%-20% is improved.This illustrates that the present invention has application well in fields such as important email identifications
Value.
Although the illustrative specific embodiment of the present invention is described above, in order to the technology of the art
Personnel understand the present invention, it should be apparent that the present invention is not limited to the range of specific embodiment, to the common skill of the art
For art personnel, if various change the attached claims limit and determine the spirit and scope of the present invention in, these
Variation is it will be apparent that all utilize the innovation and creation of present inventive concept in the column of protection.
Claims (1)
1. a kind of detection method of multi-level important email, which comprises the following steps:
(1), mail pre-processes
From the mail being collected into, N envelope mail is randomly selected, and " important email " is stamped according to the actual importance of mail respectively
Or the label of " insignificant mail ";
(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or the method for string matching
Take email address, mail matter topics and the message body three parts information in mail;
(3) secondary characteristics based on email address are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then the set that N seals all email addresses of mail can
It is expressed as A=A1∪A2∪.....∪AN, use freq+(ah,al) indicate email address to (ah,al) formed in important email
The total degree occurred in address set, uses freq-(ah,al) indicate email address to (ah,al) formed in insignificant mail
The total degree occurred in address set, wherein ah,al∈ A mailbox and address ah,alFrom same envelope mail;By email address pair
(ah,al) the ratio p that occurs in important email email address pair and in insignificant email box address pair respectively+(ah,al) and p-
(ah,al) as the secondary characteristics based on email address, wherein
(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address to set
In part, be expressed asWithIndicate the set for the email address pair that the i-th envelope mail is constituted
Include insignificant email box address to part in set, is expressed asThen the i-th envelope mail
Secondary characteristics f based on email address pairi,1It can calculate are as follows:
WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set,The i-th envelope email box address is represented to including insignificant email box address to number in set;
(4), the secondary characteristics based on mail matter topics are extracted
(4.1), mail matter topics are segmented using Chinese character Words partition system, chooses noun, verb, adjective from participle set
With adverbial word as Feature Words, F Feature Words of mail are obtained;Count in every envelope mail occur the word frequency of this F Feature Words and to
Quantization will obtain the vector matrix (TM) of the vector composing training mail of N number of F dimensionN×FAnd as the i.e. implicit Di Li Cray point of LDA
Topic model is established in input with algorithm, identifies the hiding subject information of mail, and will be obtained from the training of LDA topic model
The output of N number of T dimensional vector, that is, topic model is as the input of SVM, that is, support vector machines, wherein T is the theme number, with mail classes
Label extracts model as target, using secondary characteristics of the SVM algorithm training based on mail matter topics;Pass through the model extractable the
I seals the secondary characteristics of mail, is expressed as fi,2;
(5), the secondary characteristics of message body information are extracted
Message body is segmented using Chinese character Words partition system, and calculates the chi-square value of each participle, passes through the big of chi-square value
The big participle of G chi-square value is as Feature Words before small selection;Calculate every envelope mail correspond to this G Feature Words tf-idf value and to
Quantization, using obtained vector as the input of decision Tree algorithms C4.5, by ratio shared by the important email in each leaf node
Example is less than threshold alpha, then the node is judged as insignificant mail node;Train the mistake of an energy insignificant mail of filtration fraction
Filter model;Mail is divided by important email and insignificant mail by using the first disaggregated model that C4.5 algorithm is established;For sentencing
Break as important mail, calculate G Feature Words and be belonging respectively to the Bayesian probability of important email and insignificant mail, and will belong to
Characteristic value dyad of the ratio as character pair word of the probability of important email and the probability for belonging to insignificant mail;For
It is judged as insignificant mail, the characteristic value of G Feature Words is all assigned a value of 0 dyad;Using obtained vector as SVM
The input of algorithm is established the secondary characteristics based on message body using the true class label of mail as target and extracts model;Pass through
The model can extract the i-th secondary characteristics of the envelope mail based on message body, be expressed as fi,3;
(6), it is modeled using secondary characteristics
The feature f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi=(fi,1,fi,2,
fi,3), using the vector as the input of neural network algorithm, trains hidden layer and there was only one layer of two node, output layer only has one
The neural network of a node, judges whether mail is important by the output numerical value size of output layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510752497.7A CN105447505B (en) | 2015-11-09 | 2015-11-09 | A kind of multi-level important email detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510752497.7A CN105447505B (en) | 2015-11-09 | 2015-11-09 | A kind of multi-level important email detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105447505A CN105447505A (en) | 2016-03-30 |
CN105447505B true CN105447505B (en) | 2018-12-18 |
Family
ID=55557664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510752497.7A Active CN105447505B (en) | 2015-11-09 | 2015-11-09 | A kind of multi-level important email detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105447505B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955951B (en) * | 2016-04-29 | 2018-12-11 | 中山大学 | A kind of method and device of message screening |
CN107528763A (en) * | 2016-06-22 | 2017-12-29 | 北京易讯通信息技术股份有限公司 | A kind of Mail Contents analysis method based on Spark and YARN |
CN106357508A (en) * | 2016-08-31 | 2017-01-25 | 成都启力慧源科技有限公司 | Email classification method based on user behavior relationships |
CN106453033B (en) * | 2016-08-31 | 2019-03-15 | 电子科技大学 | Multi-level process for sorting mailings based on Mail Contents |
CN106372237A (en) * | 2016-09-13 | 2017-02-01 | 新浪(上海)企业管理有限公司 | Fraudulent mail identification method and device |
CN107391565B (en) * | 2017-06-13 | 2020-11-03 | 东南大学 | Matching method of cross-language hierarchical classification system based on topic model |
CN109800852A (en) * | 2018-11-29 | 2019-05-24 | 电子科技大学 | A kind of multi-modal spam filtering method |
CN109543050B (en) * | 2018-11-29 | 2021-08-27 | 北京航空航天大学 | Mail importance evaluation method based on session network |
CN109635254A (en) * | 2018-12-03 | 2019-04-16 | 重庆大学 | Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model |
CN109800433B (en) * | 2019-01-24 | 2023-11-10 | 深圳市小满科技有限公司 | Filing method and device based on mail two-class model, electronic equipment and medium |
CN109902236B (en) * | 2019-03-07 | 2021-06-11 | 成都数之联科技有限公司 | Junk web page degradation method based on non-probability model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6778941B1 (en) * | 2000-11-14 | 2004-08-17 | Qualia Computing, Inc. | Message and user attributes in a message filtering method and system |
CN1790405A (en) * | 2005-12-31 | 2006-06-21 | 钱德沛 | Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
CN101345720A (en) * | 2008-08-15 | 2009-01-14 | 浙江大学 | Junk mail classification method based on partial match estimation |
CN102024045A (en) * | 2010-12-14 | 2011-04-20 | 成都市华为赛门铁克科技有限公司 | Information classification processing method, device and terminal |
-
2015
- 2015-11-09 CN CN201510752497.7A patent/CN105447505B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6778941B1 (en) * | 2000-11-14 | 2004-08-17 | Qualia Computing, Inc. | Message and user attributes in a message filtering method and system |
CN1790405A (en) * | 2005-12-31 | 2006-06-21 | 钱德沛 | Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
CN101345720A (en) * | 2008-08-15 | 2009-01-14 | 浙江大学 | Junk mail classification method based on partial match estimation |
CN102024045A (en) * | 2010-12-14 | 2011-04-20 | 成都市华为赛门铁克科技有限公司 | Information classification processing method, device and terminal |
Also Published As
Publication number | Publication date |
---|---|
CN105447505A (en) | 2016-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105447505B (en) | A kind of multi-level important email detection method | |
CN103793484B (en) | The fraud identifying system based on machine learning in classification information website | |
CN112084335B (en) | Social media user account classification method based on information fusion | |
CN105005594B (en) | Abnormal microblog users recognition methods | |
CN102789498B (en) | Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning | |
CN106844424A (en) | A kind of file classification method based on LDA | |
CN105912716A (en) | Short text classification method and apparatus | |
CN102129568B (en) | Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier | |
CN105740382A (en) | Aspect classification method for short comment texts | |
CN103500175A (en) | Method for microblog hot event online detection based on emotion analysis | |
CN108241867B (en) | Classification method and device | |
Ruan et al. | GADM: Manual fake review detection for O2O commercial platforms | |
Hashida et al. | Classifying sightseeing tweets using convolutional neural networks with multi-channel distributed representation | |
CN107885849A (en) | A kind of moos index analysis system based on text classification | |
CN104809104A (en) | Method and system for identifying micro-blog textual emotion | |
CN103268346B (en) | Semisupervised classification method and system | |
Mukherjee et al. | Opinion spam detection: An unsupervised approach using generative models | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation | |
Rajesh et al. | Fraudulent news detection using machine learning approaches | |
CN105337842B (en) | A kind of rubbish mail filtering method unrelated with content | |
CN108268461A (en) | A kind of document sorting apparatus based on hybrid classifer | |
CN104572613A (en) | Data processing device, data processing method and program | |
CN108960282A (en) | A kind of online service measures of reputation method based on semi-supervised learning | |
CN105159905B (en) | Microblogging clustering method based on forwarding relationship | |
CN115033689B (en) | Original network Euclidean distance calculation method based on small sample text classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: The inventor has waived the right to be mentioned Inventor before: The inventor has waived the right to be mentioned |
|
CB03 | Change of inventor or designer information | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: 610041 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan Patentee after: Chengdu shuzhilian Technology Co.,Ltd. Address before: No.2, floor 4, building 1, Jule road crossing, Section 1, West 1st ring road, Wuhou District, Chengdu City, Sichuan Province 610041 Patentee before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd. |
|
CP03 | Change of name, title or address |