CN105447505B - A kind of multi-level important email detection method - Google Patents

A kind of multi-level important email detection method Download PDF

Info

Publication number
CN105447505B
CN105447505B CN201510752497.7A CN201510752497A CN105447505B CN 105447505 B CN105447505 B CN 105447505B CN 201510752497 A CN201510752497 A CN 201510752497A CN 105447505 B CN105447505 B CN 105447505B
Authority
CN
China
Prior art keywords
mail
email
insignificant
address
important
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510752497.7A
Other languages
Chinese (zh)
Other versions
CN105447505A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Shuzhilian Technology Co Ltd
Original Assignee
Chengdu Shuzhilian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Shuzhilian Technology Co Ltd filed Critical Chengdu Shuzhilian Technology Co Ltd
Priority to CN201510752497.7A priority Critical patent/CN105447505B/en
Publication of CN105447505A publication Critical patent/CN105447505A/en
Application granted granted Critical
Publication of CN105447505B publication Critical patent/CN105447505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a kind of multi-level important email detection method established using information such as mail address, mail matter topics and message bodies, this method combines bayes method to extract secondary characteristics of the mail based on email address first;Then secondary characteristics of the mail based on mail matter topics are extracted using LDA (implicit Di Li Cray distribution) and SVM (support vector machines) algorithm;Secondly secondary characteristics of the mail based on message body are extracted using C4.5 and SVM algorithm;Three kind secondary characteristics training neural network model of the mail based on email address, mail matter topics, message body finally extracted using front, is carried out importance to mail using the model and detects accuracy rate and recall rate with higher.

Description

A kind of multi-level important email detection method
Technical field
The invention belongs to mail-detection technical fields, more specifically, are related to a kind of detection of multi-level important email Method, the application suitable for important email detection, Spam filtering etc..
Background technique
With the rapid development of Internet technology, communicated also more and more frequently by internet.And by mail into Row communication had changed into life, work, learnt in indispensable part.However, being just increasingly becoming one kind in Email While indispensable important information media of communication, when also becoming a kind of commercial means causes user to need to spend a large amount of Between from receive detected in a large amount of mails oneself need important email.In view of the above problems, having some mail-detections at present Algorithm, but its method is all relatively simple, and this causes testing result not accurate enough, especially in the lesser situation of important email accounting Under, it more difficult to meet application demand.Therefore the accuracy rate for improving important email detection, especially in the lesser feelings of important email accounting It is the hot issue studied at present under condition.
In existing some solutions, including method based on probability, the method based on statistical learning, based on similar Spend the method etc. of cluster.Method based on probability, such as classical bayes method, principle are to calculate each classification to give this Conditional probability when group attribute value, and using the maximum class label of conditional probability as classification as a result, using the disadvantages of this method It is that precondition is not able to satisfy generally;Method based on statistical learning, such as SVM, decision tree.SVM method is relatively good at present One of process for sorting mailings, principle is that mail attribute is mapped to higher dimensional space by kernel function, in this higher dimensional space In establish largest interval hyperplane, classification belonging to mail is determined according to the plane where mail, the disadvantage is that kernel function Selection has certain blindness, lacks effective guidance, is difficult to select optimal kernel function for some particular problem;Decision tree It is a more efficient method, principle is that attribute value is first carried out discretization, then contributes by the value of discretization, It successively carries out, until the branch meets scheduled requirement, otherwise continues, until this branches into single mail.The disadvantage is that It is easy to over-fitting.Based on the method for similarity cluster, such as KNN, principle is to calculate the distance between mail, and sample is from which Classification is closely considered as which classification the envelope mail belongs to.The disadvantage is that needing to calculate the distance between mail, classification effectiveness is lower.
These methods have respective advantage, while also having the shortcomings that respective.It is more demanding and important in some accuracys rate In the case where mail and insignificant mail ratio great disparity relatively, these methods are not met by the requirement of practical application.
Summary of the invention
In view of the shortcomings of the prior art and defect, email box address, mail master are utilized the present invention provides a kind of The multi-level mail-detection method that the information such as topic and message body are established.This method is for email box address, theme, text etc. Information establishes secondary characteristics respectively and extracts model and obtain secondary characteristics using the model, then using obtained secondary characteristics as mind Input training neural network model through network.Bayes, LDA (implicit Di Li Cray distribution), SVM has been used in combination in the invention The methods of (support vector machines), decision tree can reach preferable effect in detection important email.
The specific steps of the present invention are as follows:
(1), mail pre-processes
From the mail being collected into, important email and insignificant mail randomly select total N envelope mail, and root by a certain percentage Stamp the label of " important email " or " insignificant mail " respectively according to the importance of mail itself.
(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or string matching algorithm Take the three parts information such as email address, mail matter topics and the message body in mail.
(3), the secondary characteristics based on email address pair are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then N seals the collection of all email addresses of mail Conjunction is represented by A=A1∪A2∪.....∪AN.Use freq+(ah,al) indicate email address to (ah,al) it is being labeled as important postal The number that part email address centering occurs, uses freq-(ah,al) indicate email address to (ah,al) it is being labeled as insignificant mail The number that email address centering occurs, wherein ah,al∈ A and ah,alEmail address from same envelope mail.According to following Formula can find out email address to (ah,al) respectively in important email email address to set and insignificant email box address pair The ratio p occurred in set+(ah,al) and p-(ah,al):
(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address pair The part of concentration, is expressed asWithIndicate the collection for the email address pair that the i-th envelope mail is constituted Conjunction includes all insignificant email box addresses to the part in set, is expressed asThen Secondary characteristics f of the i-th envelope mail based on email address pairi,1It can calculate are as follows:
WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set Mesh,The i-th envelope email box address is represented to including insignificant email box address to number in set.
(4), the secondary characteristics based on mail matter topics are extracted
(4.1), mail matter topics are segmented using Chinese character Words partition system, noun, verb, adjective is chosen from participle With adverbial word as Feature Words, F Feature Words in mail are obtained.
(4.2), the F Feature Words obtained according to step (4.1), which count in the i-th envelope mail, there is the word of this F Feature Words Frequency dyad obtains the vector X of N number of F dimensioni=(xi,1,xi,2,...,xi,F), 1≤i≤N, the vector composing training of N number of F dimension The vector matrix (TM) of mailN×F.First by vector matrix (TM)N×FFoundation master as LDA (implicit Di Li Cray distribution) algorithm Model is inscribed, identifies the potential subject information of mail, the vector X ' of N number of T dimension is obtained by the output of topic modeli=(x 'i,1, x′i,2,...,x′i,T), it constitutes output matrix (TM_SVM)N×T, wherein T is previously given number of topics.Then will obtain to Moment matrix (TM_SVM)N×TAs input, using the label of mail as target, postal is based on using the training of SVM (support vector machines) algorithm The disaggregated model of part theme.Belonged to by the available i-th envelope mail of the output of this disaggregated model based on mail matter topics important The probability of mail, and the secondary characteristics using the probability as the mail based on mail matter topics, are expressed as fi,2
(5), the secondary characteristics of message body information are extracted
(5.1), message body pre-processes
Message body is segmented using Chinese character segmenter system.Noun is chosen from participle according to part of speech and verb is made For candidate feature word, and then obtain training the candidate feature set of words of mail, then according to the following formula:
Calculate the chi-square value of each candidate feature word, wherein t represents candidate feature word, and c represents classification (to be only had herein It is important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t in non-c class The number occurred in other mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c class mails The middle number for non-candidate Feature Words t occur, N represent the size of training set.After the preceding G candidate feature word for taking chi-square value big is used as The Feature Words of continuous processing can filter out those by this method and contribute small Feature Words to classification to reduce the complexity of calculating Degree.
(5.2), mail is just classified
The G Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of the i-th envelope mail features word, obtain New vector Yi=(yi,1,yi,2,...,yi,G), 1≤i≤N.The vector Y that will be obtainediAs the input of decision Tree algorithms C4.5, Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as insignificant mail node, Higher level is in the recall rate for guaranteeing important email totality by threshold alpha in each leaf node of control.Train one Can the insignificant mail of filtration fraction first disaggregated model, by the first disaggregated model of foundation by mail be divided into important email with Insignificant two class of mail.
(5.3), secondary characteristics are extracted
Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair Answer the characteristic value dyad of Feature Words;It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail Portion is assigned a value of 0, equally progress vectorization, obtained new vector Zi=(zi,1,zi,2,...,zi,G), 1≤i≤N, by vector Zi As the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target.Pass through The available i-th envelope mail of output of this disaggregated model based on message body belongs to the probability of important email, and by the probability Secondary characteristics as the mail based on mail matter topics, are expressed as fi,3
(6), it is modeled using secondary characteristics
The secondary characteristics f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi= (fi,1,fi,2,fi,3), using the vector as the input of neural network, trains hidden layer and there was only one layer of two node, output layer The neural network of only one node, the output interval of neural network are [0,1].If output valve is greater than threshold θ, which is Otherwise important email is inessential mail.
(7), important email detects
When predicting mail, mail is respectively obtained using preceding step (3) (4) (5) and is based on address, mail master Topic, the secondary characteristics of message body, the neural network classification model that recycle step (6) is established examine mail to be identified It surveys.
Detailed description of the invention
Fig. 1 is that the present invention is examined using the multi-level mail that the information such as mail address, mail matter topics and message body are established The flow chart of survey method.
Specific embodiment
A specific embodiment of the invention is described with reference to the accompanying drawing, preferably so as to those skilled in the art Understand the present invention.Requiring particular attention is that in the following description, when known function and the detailed description of design perhaps When can desalinate main contents of the invention, these descriptions will be ignored herein.
The present embodiment use mail be the circular mail for including: International Academic Conference, the communication mail of international trade, Communication mail between enterprise, some advertisement matters, the fishing mail etc. propagated on network.According to the actual situation, by first three The mail of type is as important email, and rear two class is as inessential mail.
In the example of this implementation, the classification processing method of mail the following steps are included:
(1), mail pre-processes
N is randomly selected from the mail being collected into1(N1=700) important email, N are sealed2(N2=7000) inessential mail is sealed Total N (N=7700) envelope mail stamps " important email " or " insignificant postal as training text, by the N extracted envelope mail respectively (important email is labeled as 1 to the label of part " here, and 0) insignificant mail is labeled as, the step ST1 in the step corresponding diagram 1.
(2), to each envelope mail, by regular expression matching or string matching algorithm, extract email box address, Mail matter topics and message body three parts information.The matching expression of mail address are as follows:
Reg_add="/^ w+ ([.-]? w+) *@w+ ([.-]? w+) * (w { 2,3 })+$/";
Mail matter topics information is extracted according to the subject label and content label that occur in mail;Message body is to delete Fall the address information matched and the remaining information of subject information.The step is the step ST2 in Fig. 1.
(3) secondary characteristics based on email address pair are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then N seals the collection of all email addresses of mail Conjunction is represented by A=A1∪A2∪.....∪A7700, and calculate freq+(ah,al) and freq-(ah,al), wherein ah,al∈ A is simultaneously And ah,alEmail address from same mail.Email address can be found out according to the following formula to (ah,al) respectively in important postal Part email address is to set and insignificant email box address to the ratio p occurred in set+(ah,al) and p-(ah,al):
(3.2), it usesIndicate the set for the email address pair that the i-th envelope mail is constituted included in place important email mailbox Location is represented by the part of concentrationWithIndicate the email address pair that the i-th envelope mail is constituted Set include all insignificant email box addresses to part in set, be represented by 1≤i≤7700 utilize following formula:
The i-th secondary characteristics of the envelope mail based on email address are calculated, whereinRepresent the email address in the i-th envelope mail To including important email email address to the number in set,It is non-heavy to being included in represent the i-th envelope email box address Want email box address to number in set.The step is the step ST4 in Fig. 1.
(4) secondary characteristics based on mail matter topics are extracted
(4.1) theme of every envelope mail is segmented using Chinese character Words partition system, chosen from participle noun, verb, Adjective and adverbial word obtain F=205 Feature Words of mail, which is the step ST5 in Fig. 1 as Feature Words.
(4.2) it is counted according to the F Feature Words that step (4.1) obtains and occurs the word frequency of this F Feature Words in every envelope mail Dyad obtains the vector X of N number of F dimensioni=(xi,1,xi,2,...,xi,205), 1≤i≤7700, the vector of composing training mail Matrix (TM)7700×205.By vector matrix (TM)7700×205Theme is established in input as LDA (implicit Di Li Cray distribution) algorithm The theme modeling of number T=12 identifies that the potential subject information of mail is obtained by the output of topic model by theme modeling To the vector X ' of N number of 12 dimensioni=(x 'i,1,x′i,2,...,x′i,12), it constitutes output matrix (TM_SVM)7700×12.Then it will obtain Vector matrix (TM_SVM)7700×12As input, using gaussian kernel function, SVM (support vector machines) algorithm training base is utilized Model is extracted in the secondary characteristics of mail matter topics.By the model can extract the secondary characteristics based on mail matter topics (this feature Meaning is the probability that mail belongs to important email), it is expressed as fi,2.The step is the step ST6 in Fig. 1.
(5), the secondary characteristics of message body information are extracted
(5.1), message body pre-processes
Message body is segmented using Chinese character segmenter system, noun and verb are chosen from participle as candidate special Word is levied, and then obtains the candidate feature set of words of all trained mails, then according to the following formula:
Count the chi-square value of each candidate feature word, wherein t represents all candidate feature words, and c represents classification (herein It is only important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t and exists The number occurred in non-c classification mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c Occurs the number of non-candidate Feature Words t in class mail, N represents number N=7700 that entire training set contains mail.In all times The Feature Words that G=230 chi-square value is big before taking in Feature Words are selected, it is small to classification contribution to filter out those by this method Feature Words are to reduce the complexity of calculating.The step is the step ST7 in Fig. 1.
(5.2), filtrating mail
The G=230 Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of every envelope mail features word, Obtain new vector Yi=(yi,1,yi,2,...,yi,230), 1≤i≤7700.The vector Y that will be obtainediAs decision Tree algorithms Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as non-heavy by the input of C4.5 Want mail node, by controlling threshold alpha=0.03 in each leaf node with guarantee the recall rate of important email totality in compared with High level.Mail is divided into two class of important email and insignificant mail by the first disaggregated model of foundation.The step is Fig. 1 In step ST8.
(5.3), secondary characteristics are extracted
Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair Answer the characteristic value dyad of Feature Words;It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail Portion is assigned a value of 0, equally progress vectorization, obtained new vector Zi=(zi,1,zi,2,...,zi,230), 1≤i≤7700.It will be to Measure ZiAs the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target. Belong to the probability of important email by the available i-th envelope mail of the output of this disaggregated model based on message body, and should Secondary characteristics of the probability as the mail based on mail matter topics, are expressed as fi,3.The step is the step ST9 in Fig. 1.
(6), it is modeled using secondary characteristics
The feature f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi=(fi,1, fi,2,fi,3), 1≤i≤7700 train hidden layer and only have to include two sections using the vector as the input of neural network Point, the neural network of only one node of output layer, the section [0,1] of the output valve of neural network.When defeated by neural network Value out is greater than θ, then is judged as important email, is otherwise judged as insignificant mail, by training analysis, θ takes 0.53.The step For the step ST10 in Fig. 1.
(7), mail is predicted
In order to verify the effect based on multi-level important email detection method, the method that we use cross validation, 80% is randomly selected in 7700 envelope mails of previous processed as training set, 20% as verifying collection progress cross validation.It repeats It carries out 100 times.Average Accuracy is 86.2%, and average recall rate is 90.3%.
It can be seen that under important email and the inessential unbalanced situation of mail ratio by the result, with other mails Disaggregated model is compared, and 15%-20% is improved.This illustrates that the present invention has application well in fields such as important email identifications Value.
Although the illustrative specific embodiment of the present invention is described above, in order to the technology of the art Personnel understand the present invention, it should be apparent that the present invention is not limited to the range of specific embodiment, to the common skill of the art For art personnel, if various change the attached claims limit and determine the spirit and scope of the present invention in, these Variation is it will be apparent that all utilize the innovation and creation of present inventive concept in the column of protection.

Claims (1)

1. a kind of detection method of multi-level important email, which comprises the following steps:
(1), mail pre-processes
From the mail being collected into, N envelope mail is randomly selected, and " important email " is stamped according to the actual importance of mail respectively Or the label of " insignificant mail ";
(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or the method for string matching Take email address, mail matter topics and the message body three parts information in mail;
(3) secondary characteristics based on email address are extracted
It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope maili, then the set that N seals all email addresses of mail can It is expressed as A=A1∪A2∪.....∪AN, use freq+(ah,al) indicate email address to (ah,al) formed in important email The total degree occurred in address set, uses freq-(ah,al) indicate email address to (ah,al) formed in insignificant mail The total degree occurred in address set, wherein ah,al∈ A mailbox and address ah,alFrom same envelope mail;By email address pair (ah,al) the ratio p that occurs in important email email address pair and in insignificant email box address pair respectively+(ah,al) and p- (ah,al) as the secondary characteristics based on email address, wherein
(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address to set In part, be expressed asWithIndicate the set for the email address pair that the i-th envelope mail is constituted Include insignificant email box address to part in set, is expressed asThen the i-th envelope mail Secondary characteristics f based on email address pairi,1It can calculate are as follows:
WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set,The i-th envelope email box address is represented to including insignificant email box address to number in set;
(4), the secondary characteristics based on mail matter topics are extracted
(4.1), mail matter topics are segmented using Chinese character Words partition system, chooses noun, verb, adjective from participle set With adverbial word as Feature Words, F Feature Words of mail are obtained;Count in every envelope mail occur the word frequency of this F Feature Words and to Quantization will obtain the vector matrix (TM) of the vector composing training mail of N number of F dimensionN×FAnd as the i.e. implicit Di Li Cray point of LDA Topic model is established in input with algorithm, identifies the hiding subject information of mail, and will be obtained from the training of LDA topic model The output of N number of T dimensional vector, that is, topic model is as the input of SVM, that is, support vector machines, wherein T is the theme number, with mail classes Label extracts model as target, using secondary characteristics of the SVM algorithm training based on mail matter topics;Pass through the model extractable the I seals the secondary characteristics of mail, is expressed as fi,2
(5), the secondary characteristics of message body information are extracted
Message body is segmented using Chinese character Words partition system, and calculates the chi-square value of each participle, passes through the big of chi-square value The big participle of G chi-square value is as Feature Words before small selection;Calculate every envelope mail correspond to this G Feature Words tf-idf value and to Quantization, using obtained vector as the input of decision Tree algorithms C4.5, by ratio shared by the important email in each leaf node Example is less than threshold alpha, then the node is judged as insignificant mail node;Train the mistake of an energy insignificant mail of filtration fraction Filter model;Mail is divided by important email and insignificant mail by using the first disaggregated model that C4.5 algorithm is established;For sentencing Break as important mail, calculate G Feature Words and be belonging respectively to the Bayesian probability of important email and insignificant mail, and will belong to Characteristic value dyad of the ratio as character pair word of the probability of important email and the probability for belonging to insignificant mail;For It is judged as insignificant mail, the characteristic value of G Feature Words is all assigned a value of 0 dyad;Using obtained vector as SVM The input of algorithm is established the secondary characteristics based on message body using the true class label of mail as target and extracts model;Pass through The model can extract the i-th secondary characteristics of the envelope mail based on message body, be expressed as fi,3
(6), it is modeled using secondary characteristics
The feature f that step (3), step (4), step (5) are obtainedi,1,fi,2,fi,3Form new vector Vi=(fi,1,fi,2, fi,3), using the vector as the input of neural network algorithm, trains hidden layer and there was only one layer of two node, output layer only has one The neural network of a node, judges whether mail is important by the output numerical value size of output layer.
CN201510752497.7A 2015-11-09 2015-11-09 A kind of multi-level important email detection method Active CN105447505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510752497.7A CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510752497.7A CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Publications (2)

Publication Number Publication Date
CN105447505A CN105447505A (en) 2016-03-30
CN105447505B true CN105447505B (en) 2018-12-18

Family

ID=55557664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510752497.7A Active CN105447505B (en) 2015-11-09 2015-11-09 A kind of multi-level important email detection method

Country Status (1)

Country Link
CN (1) CN105447505B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955951B (en) * 2016-04-29 2018-12-11 中山大学 A kind of method and device of message screening
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN106357508A (en) * 2016-08-31 2017-01-25 成都启力慧源科技有限公司 Email classification method based on user behavior relationships
CN106453033B (en) * 2016-08-31 2019-03-15 电子科技大学 Multi-level process for sorting mailings based on Mail Contents
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN107391565B (en) * 2017-06-13 2020-11-03 东南大学 Matching method of cross-language hierarchical classification system based on topic model
CN109543050B (en) * 2018-11-29 2021-08-27 北京航空航天大学 Mail importance evaluation method based on session network
CN109800852A (en) * 2018-11-29 2019-05-24 电子科技大学 A kind of multi-modal spam filtering method
CN109635254A (en) * 2018-12-03 2019-04-16 重庆大学 Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model
CN109800433B (en) * 2019-01-24 2023-11-10 深圳市小满科技有限公司 Filing method and device based on mail two-class model, electronic equipment and medium
CN109902236B (en) * 2019-03-07 2021-06-11 成都数之联科技有限公司 Junk web page degradation method based on non-probability model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6778941B1 (en) * 2000-11-14 2004-08-17 Qualia Computing, Inc. Message and user attributes in a message filtering method and system
CN1790405A (en) * 2005-12-31 2006-06-21 钱德沛 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101345720A (en) * 2008-08-15 2009-01-14 浙江大学 Junk mail classification method based on partial match estimation
CN102024045A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Information classification processing method, device and terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6778941B1 (en) * 2000-11-14 2004-08-17 Qualia Computing, Inc. Message and user attributes in a message filtering method and system
CN1790405A (en) * 2005-12-31 2006-06-21 钱德沛 Content classification and authentication algorithm based on Bayesian classification for unsolicited Chinese email
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101345720A (en) * 2008-08-15 2009-01-14 浙江大学 Junk mail classification method based on partial match estimation
CN102024045A (en) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 Information classification processing method, device and terminal

Also Published As

Publication number Publication date
CN105447505A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
CN105447505B (en) A kind of multi-level important email detection method
CN103793484B (en) The fraud identifying system based on machine learning in classification information website
CN112084335B (en) Social media user account classification method based on information fusion
CN106202032B (en) A kind of sentiment analysis method and its system towards microblogging short text
CN102789498B (en) Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning
CN106844424A (en) A kind of file classification method based on LDA
CN105912716A (en) Short text classification method and apparatus
CN102129568B (en) Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier
CN105740382A (en) Aspect classification method for short comment texts
CN112199608A (en) Social media rumor detection method based on network information propagation graph modeling
CN102404249A (en) Method and device for filtering junk emails based on coordinated training
Hashida et al. Classifying sightseeing tweets using convolutional neural networks with multi-channel distributed representation
CN107885849A (en) A kind of moos index analysis system based on text classification
CN108241867B (en) Classification method and device
CN108280164A (en) A kind of short text filtering and sorting technique based on classification related words
CN104142960A (en) Internet data analysis system
CN105574213A (en) Microblog recommendation method and device based on data mining technology
CN103268346B (en) Semisupervised classification method and system
Mukherjee et al. Opinion spam detection: An unsupervised approach using generative models
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
CN104572613A (en) Data processing device, data processing method and program
CN105159905B (en) Microblogging clustering method based on forwarding relationship
CN106991171A (en) Topic based on Intelligent campus information service platform finds method
CN106874944A (en) A kind of measure of the classification results confidence level based on Bagging and outlier

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: The inventor has waived the right to be mentioned

Inventor before: The inventor has waived the right to be mentioned

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 610041 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan

Patentee after: Chengdu shuzhilian Technology Co.,Ltd.

Address before: No.2, floor 4, building 1, Jule road crossing, Section 1, West 1st ring road, Wuhou District, Chengdu City, Sichuan Province 610041

Patentee before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address