CN105447505B

CN105447505B - A kind of multi-level important email detection method

Info

Publication number: CN105447505B
Application number: CN201510752497.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2018-12-18
Anticipated expiration: 2035-11-09
Also published as: CN105447505A

Abstract

The invention discloses a kind of multi-level important email detection method established using information such as mail address, mail matter topics and message bodies, this method combines bayes method to extract secondary characteristics of the mail based on email address first；Then secondary characteristics of the mail based on mail matter topics are extracted using LDA (implicit Di Li Cray distribution) and SVM (support vector machines) algorithm；Secondly secondary characteristics of the mail based on message body are extracted using C4.5 and SVM algorithm；Three kind secondary characteristics training neural network model of the mail based on email address, mail matter topics, message body finally extracted using front, is carried out importance to mail using the model and detects accuracy rate and recall rate with higher.

Description

A kind of multi-level important email detection method

Technical field

The invention belongs to mail-detection technical fields, more specifically, are related to a kind of detection of multi-level important email Method, the application suitable for important email detection, Spam filtering etc..

Background technique

With the rapid development of Internet technology, communicated also more and more frequently by internet.And by mail into Row communication had changed into life, work, learnt in indispensable part.However, being just increasingly becoming one kind in Email While indispensable important information media of communication, when also becoming a kind of commercial means causes user to need to spend a large amount of Between from receive detected in a large amount of mails oneself need important email.In view of the above problems, having some mail-detections at present Algorithm, but its method is all relatively simple, and this causes testing result not accurate enough, especially in the lesser situation of important email accounting Under, it more difficult to meet application demand.Therefore the accuracy rate for improving important email detection, especially in the lesser feelings of important email accounting It is the hot issue studied at present under condition.

In existing some solutions, including method based on probability, the method based on statistical learning, based on similar Spend the method etc. of cluster.Method based on probability, such as classical bayes method, principle are to calculate each classification to give this Conditional probability when group attribute value, and using the maximum class label of conditional probability as classification as a result, using the disadvantages of this method It is that precondition is not able to satisfy generally；Method based on statistical learning, such as SVM, decision tree.SVM method is relatively good at present One of process for sorting mailings, principle is that mail attribute is mapped to higher dimensional space by kernel function, in this higher dimensional space In establish largest interval hyperplane, classification belonging to mail is determined according to the plane where mail, the disadvantage is that kernel function Selection has certain blindness, lacks effective guidance, is difficult to select optimal kernel function for some particular problem；Decision tree It is a more efficient method, principle is that attribute value is first carried out discretization, then contributes by the value of discretization, It successively carries out, until the branch meets scheduled requirement, otherwise continues, until this branches into single mail.The disadvantage is that It is easy to over-fitting.Based on the method for similarity cluster, such as KNN, principle is to calculate the distance between mail, and sample is from which Classification is closely considered as which classification the envelope mail belongs to.The disadvantage is that needing to calculate the distance between mail, classification effectiveness is lower.

These methods have respective advantage, while also having the shortcomings that respective.It is more demanding and important in some accuracys rate In the case where mail and insignificant mail ratio great disparity relatively, these methods are not met by the requirement of practical application.

Summary of the invention

In view of the shortcomings of the prior art and defect, email box address, mail master are utilized the present invention provides a kind of The multi-level mail-detection method that the information such as topic and message body are established.This method is for email box address, theme, text etc. Information establishes secondary characteristics respectively and extracts model and obtain secondary characteristics using the model, then using obtained secondary characteristics as mind Input training neural network model through network.Bayes, LDA (implicit Di Li Cray distribution), SVM has been used in combination in the invention The methods of (support vector machines), decision tree can reach preferable effect in detection important email.

The specific steps of the present invention are as follows:

(1), mail pre-processes

From the mail being collected into, important email and insignificant mail randomly select total N envelope mail, and root by a certain percentage Stamp the label of " important email " or " insignificant mail " respectively according to the importance of mail itself.

(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or string matching algorithm Take the three parts information such as email address, mail matter topics and the message body in mail.

(3), the secondary characteristics based on email address pair are extracted

It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope mail_i, then N seals the collection of all email addresses of mail Conjunction is represented by A=A₁∪A₂∪.....∪A_N.Use freq⁺(a_h,a_l) indicate email address to (a_h,a_l) it is being labeled as important postal The number that part email address centering occurs, uses freq^-(a_h,a_l) indicate email address to (a_h,a_l) it is being labeled as insignificant mail The number that email address centering occurs, wherein a_h,a_l∈ A and a_h,a_lEmail address from same envelope mail.According to following Formula can find out email address to (a_h,a_l) respectively in important email email address to set and insignificant email box address pair The ratio p occurred in set⁺(a_h,a_l) and p^-(a_h,a_l):

(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address pair The part of concentration, is expressed asWithIndicate the collection for the email address pair that the i-th envelope mail is constituted Conjunction includes all insignificant email box addresses to the part in set, is expressed asThen Secondary characteristics f of the i-th envelope mail based on email address pair_i,1It can calculate are as follows:

WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set Mesh,The i-th envelope email box address is represented to including insignificant email box address to number in set.

(4), the secondary characteristics based on mail matter topics are extracted

(4.1), mail matter topics are segmented using Chinese character Words partition system, noun, verb, adjective is chosen from participle With adverbial word as Feature Words, F Feature Words in mail are obtained.

(4.2), the F Feature Words obtained according to step (4.1), which count in the i-th envelope mail, there is the word of this F Feature Words Frequency dyad obtains the vector X of N number of F dimension_i=(x_i,1,x_i,2,...,x_i,F), 1≤i≤N, the vector composing training of N number of F dimension The vector matrix (TM) of mail_N×F.First by vector matrix (TM)_N×FFoundation master as LDA (implicit Di Li Cray distribution) algorithm Model is inscribed, identifies the potential subject information of mail, the vector X ' of N number of T dimension is obtained by the output of topic model_i=(x '_i,1, x′_i,2,...,x′_i,T), it constitutes output matrix (TM_SVM)_N×T, wherein T is previously given number of topics.Then will obtain to Moment matrix (TM_SVM)_N×TAs input, using the label of mail as target, postal is based on using the training of SVM (support vector machines) algorithm The disaggregated model of part theme.Belonged to by the available i-th envelope mail of the output of this disaggregated model based on mail matter topics important The probability of mail, and the secondary characteristics using the probability as the mail based on mail matter topics, are expressed as f_i,2。

(5), the secondary characteristics of message body information are extracted

(5.1), message body pre-processes

Message body is segmented using Chinese character segmenter system.Noun is chosen from participle according to part of speech and verb is made For candidate feature word, and then obtain training the candidate feature set of words of mail, then according to the following formula:

Calculate the chi-square value of each candidate feature word, wherein t represents candidate feature word, and c represents classification (to be only had herein It is important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t in non-c class The number occurred in other mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c class mails The middle number for non-candidate Feature Words t occur, N represent the size of training set.After the preceding G candidate feature word for taking chi-square value big is used as The Feature Words of continuous processing can filter out those by this method and contribute small Feature Words to classification to reduce the complexity of calculating Degree.

(5.2), mail is just classified

The G Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of the i-th envelope mail features word, obtain New vector Y_i=(y_i,1,y_i,2,...,y_i,G), 1≤i≤N.The vector Y that will be obtained_iAs the input of decision Tree algorithms C4.5, Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as insignificant mail node, Higher level is in the recall rate for guaranteeing important email totality by threshold alpha in each leaf node of control.Train one Can the insignificant mail of filtration fraction first disaggregated model, by the first disaggregated model of foundation by mail be divided into important email with Insignificant two class of mail.

(5.3), secondary characteristics are extracted

Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair Answer the characteristic value dyad of Feature Words；It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail Portion is assigned a value of 0, equally progress vectorization, obtained new vector Z_i=(z_i,1,z_i,2,...,z_i,G), 1≤i≤N, by vector Z_i As the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target.Pass through The available i-th envelope mail of output of this disaggregated model based on message body belongs to the probability of important email, and by the probability Secondary characteristics as the mail based on mail matter topics, are expressed as f_i,3。

(6), it is modeled using secondary characteristics

The secondary characteristics f that step (3), step (4), step (5) are obtained_i,1,f_i,2,f_i,3Form new vector V_i= (f_i,1,f_i,2,f_i,3), using the vector as the input of neural network, trains hidden layer and there was only one layer of two node, output layer The neural network of only one node, the output interval of neural network are [0,1].If output valve is greater than threshold θ, which is Otherwise important email is inessential mail.

(7), important email detects

When predicting mail, mail is respectively obtained using preceding step (3) (4) (5) and is based on address, mail master Topic, the secondary characteristics of message body, the neural network classification model that recycle step (6) is established examine mail to be identified It surveys.

Detailed description of the invention

Fig. 1 is that the present invention is examined using the multi-level mail that the information such as mail address, mail matter topics and message body are established The flow chart of survey method.

Specific embodiment

A specific embodiment of the invention is described with reference to the accompanying drawing, preferably so as to those skilled in the art Understand the present invention.Requiring particular attention is that in the following description, when known function and the detailed description of design perhaps When can desalinate main contents of the invention, these descriptions will be ignored herein.

The present embodiment use mail be the circular mail for including: International Academic Conference, the communication mail of international trade, Communication mail between enterprise, some advertisement matters, the fishing mail etc. propagated on network.According to the actual situation, by first three The mail of type is as important email, and rear two class is as inessential mail.

In the example of this implementation, the classification processing method of mail the following steps are included:

(1), mail pre-processes

N is randomly selected from the mail being collected into₁(N₁=700) important email, N are sealed₂(N₂=7000) inessential mail is sealed Total N (N=7700) envelope mail stamps " important email " or " insignificant postal as training text, by the N extracted envelope mail respectively (important email is labeled as 1 to the label of part " here, and 0) insignificant mail is labeled as, the step ST1 in the step corresponding diagram 1.

(2), to each envelope mail, by regular expression matching or string matching algorithm, extract email box address, Mail matter topics and message body three parts information.The matching expression of mail address are as follows:

Reg_add="/^ w+ ([.-]? w+) *@w+ ([.-]? w+) * (w { 2,3 })+$/"；

Mail matter topics information is extracted according to the subject label and content label that occur in mail；Message body is to delete Fall the address information matched and the remaining information of subject information.The step is the step ST2 in Fig. 1.

(3) secondary characteristics based on email address pair are extracted

It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope mail_i, then N seals the collection of all email addresses of mail Conjunction is represented by A=A₁∪A₂∪.....∪A₇₇₀₀, and calculate freq⁺(a_h,a_l) and freq^-(a_h,a_l), wherein a_h,a_l∈ A is simultaneously And a_h,a_lEmail address from same mail.Email address can be found out according to the following formula to (a_h,a_l) respectively in important postal Part email address is to set and insignificant email box address to the ratio p occurred in set⁺(a_h,a_l) and p^-(a_h,a_l):

(3.2), it usesIndicate the set for the email address pair that the i-th envelope mail is constituted included in place important email mailbox Location is represented by the part of concentrationWithIndicate the email address pair that the i-th envelope mail is constituted Set include all insignificant email box addresses to part in set, be represented by 1≤i≤7700 utilize following formula:

The i-th secondary characteristics of the envelope mail based on email address are calculated, whereinRepresent the email address in the i-th envelope mail To including important email email address to the number in set,It is non-heavy to being included in represent the i-th envelope email box address Want email box address to number in set.The step is the step ST4 in Fig. 1.

(4) secondary characteristics based on mail matter topics are extracted

(4.1) theme of every envelope mail is segmented using Chinese character Words partition system, chosen from participle noun, verb, Adjective and adverbial word obtain F=205 Feature Words of mail, which is the step ST5 in Fig. 1 as Feature Words.

(4.2) it is counted according to the F Feature Words that step (4.1) obtains and occurs the word frequency of this F Feature Words in every envelope mail Dyad obtains the vector X of N number of F dimension_i=(x_i,1,x_i,2,...,x_i,205), 1≤i≤7700, the vector of composing training mail Matrix (TM)_7700×205.By vector matrix (TM)_7700×205Theme is established in input as LDA (implicit Di Li Cray distribution) algorithm The theme modeling of number T=12 identifies that the potential subject information of mail is obtained by the output of topic model by theme modeling To the vector X ' of N number of 12 dimension_i=(x '_i,1,x′_i,2,...,x′_i,12), it constitutes output matrix (TM_SVM)_7700×12.Then it will obtain Vector matrix (TM_SVM)_7700×12As input, using gaussian kernel function, SVM (support vector machines) algorithm training base is utilized Model is extracted in the secondary characteristics of mail matter topics.By the model can extract the secondary characteristics based on mail matter topics (this feature Meaning is the probability that mail belongs to important email), it is expressed as f_i,2.The step is the step ST6 in Fig. 1.

(5), the secondary characteristics of message body information are extracted

(5.1), message body pre-processes

Message body is segmented using Chinese character segmenter system, noun and verb are chosen from participle as candidate special Word is levied, and then obtains the candidate feature set of words of all trained mails, then according to the following formula:

Count the chi-square value of each candidate feature word, wherein t represents all candidate feature words, and c represents classification (herein It is only important and insignificant), A represents the number that candidate feature word t occurs in c classification mail, and B represents candidate feature word t and exists The number occurred in non-c classification mail, C, which is represented, there is the number of non-candidate Feature Words t in c classification mail, D represents all non-c Occurs the number of non-candidate Feature Words t in class mail, N represents number N=7700 that entire training set contains mail.In all times The Feature Words that G=230 chi-square value is big before taking in Feature Words are selected, it is small to classification contribution to filter out those by this method Feature Words are to reduce the complexity of calculating.The step is the step ST7 in Fig. 1.

(5.2), filtrating mail

The G=230 Feature Words according to obtained in (5.1) calculate the tf-idf value dyad of every envelope mail features word, Obtain new vector Y_i=(y_i,1,y_i,2,...,y_i,230), 1≤i≤7700.The vector Y that will be obtained_iAs decision Tree algorithms Ratio shared by important email in each leaf node is less than threshold alpha, then the node is judged as non-heavy by the input of C4.5 Want mail node, by controlling threshold alpha=0.03 in each leaf node with guarantee the recall rate of important email totality in compared with High level.Mail is divided into two class of important email and insignificant mail by the first disaggregated model of foundation.The step is Fig. 1 In step ST8.

(5.3), secondary characteristics are extracted

Mail is divided into important email and insignificant mail by the first disaggregated model of (5.2) step.Attach most importance to for judgement The mail wanted calculates G Feature Words obtained in (5.1) step and is belonging respectively to important email and the Bayes of insignificant mail is general Rate, and using Feature Words belong to the ratio of the Bayesian probability of important email and the Bayesian probability for belonging to insignificant mail as pair Answer the characteristic value dyad of Feature Words；It is directly that the corresponding characteristic value of G Feature Words is complete for being judged as insignificant mail Portion is assigned a value of 0, equally progress vectorization, obtained new vector Z_i=(z_i,1,z_i,2,...,z_i,230), 1≤i≤7700.It will be to Measure Z_iAs the input of SVM algorithm, the disaggregated model based on message body is established using the true class label of mail as target. Belong to the probability of important email by the available i-th envelope mail of the output of this disaggregated model based on message body, and should Secondary characteristics of the probability as the mail based on mail matter topics, are expressed as f_i,3.The step is the step ST9 in Fig. 1.

(6), it is modeled using secondary characteristics

The feature f that step (3), step (4), step (5) are obtained_i,1,f_i,2,f_i,3Form new vector V_i=(f_i,1, f_i,2,f_i,3), 1≤i≤7700 train hidden layer and only have to include two sections using the vector as the input of neural network Point, the neural network of only one node of output layer, the section [0,1] of the output valve of neural network.When defeated by neural network Value out is greater than θ, then is judged as important email, is otherwise judged as insignificant mail, by training analysis, θ takes 0.53.The step For the step ST10 in Fig. 1.

(7), mail is predicted

In order to verify the effect based on multi-level important email detection method, the method that we use cross validation, 80% is randomly selected in 7700 envelope mails of previous processed as training set, 20% as verifying collection progress cross validation.It repeats It carries out 100 times.Average Accuracy is 86.2%, and average recall rate is 90.3%.

It can be seen that under important email and the inessential unbalanced situation of mail ratio by the result, with other mails Disaggregated model is compared, and 15%-20% is improved.This illustrates that the present invention has application well in fields such as important email identifications Value.

Although the illustrative specific embodiment of the present invention is described above, in order to the technology of the art Personnel understand the present invention, it should be apparent that the present invention is not limited to the range of specific embodiment, to the common skill of the art For art personnel, if various change the attached claims limit and determine the spirit and scope of the present invention in, these Variation is it will be apparent that all utilize the innovation and creation of present inventive concept in the column of protection.

Claims

1. a kind of detection method of multi-level important email, which comprises the following steps:

(1), mail pre-processes

From the mail being collected into, N envelope mail is randomly selected, and " important email " is stamped according to the actual importance of mail respectively Or the label of " insignificant mail "；

(2), the mail extracted for each envelope is mentioned by regular expression matching algorithm or the method for string matching Take email address, mail matter topics and the message body three parts information in mail；

(3) secondary characteristics based on email address are extracted

It (3.1), is A by the transmitting-receiving email address set expression of the i-th envelope mail_i, then the set that N seals all email addresses of mail can It is expressed as A=A₁∪A₂∪.....∪A_N, use freq⁺(a_h,a_l) indicate email address to (a_h,a_l) formed in important email The total degree occurred in address set, uses freq^-(a_h,a_l) indicate email address to (a_h,a_l) formed in insignificant mail The total degree occurred in address set, wherein a_h,a_l∈ A mailbox and address a_h,a_lFrom same envelope mail；By email address pair (a_h,a_l) the ratio p that occurs in important email email address pair and in insignificant email box address pair respectively⁺(a_h,a_l) and p^- (a_h,a_l) as the secondary characteristics based on email address, wherein

(3.2), it usesIndicate that the set for the email address pair that the i-th envelope mail is constituted is included in important email email address to set In part, be expressed asWithIndicate the set for the email address pair that the i-th envelope mail is constituted Include insignificant email box address to part in set, is expressed asThen the i-th envelope mail Secondary characteristics f based on email address pair_i,1It can calculate are as follows:

WhereinEmail address in the i-th envelope mail is represented to including important email email address to the number in set,The i-th envelope email box address is represented to including insignificant email box address to number in set；

(4), the secondary characteristics based on mail matter topics are extracted

(4.1), mail matter topics are segmented using Chinese character Words partition system, chooses noun, verb, adjective from participle set With adverbial word as Feature Words, F Feature Words of mail are obtained；Count in every envelope mail occur the word frequency of this F Feature Words and to Quantization will obtain the vector matrix (TM) of the vector composing training mail of N number of F dimension_N×FAnd as the i.e. implicit Di Li Cray point of LDA Topic model is established in input with algorithm, identifies the hiding subject information of mail, and will be obtained from the training of LDA topic model The output of N number of T dimensional vector, that is, topic model is as the input of SVM, that is, support vector machines, wherein T is the theme number, with mail classes Label extracts model as target, using secondary characteristics of the SVM algorithm training based on mail matter topics；Pass through the model extractable the I seals the secondary characteristics of mail, is expressed as f_i,2；

(5), the secondary characteristics of message body information are extracted

Message body is segmented using Chinese character Words partition system, and calculates the chi-square value of each participle, passes through the big of chi-square value The big participle of G chi-square value is as Feature Words before small selection；Calculate every envelope mail correspond to this G Feature Words tf-idf value and to Quantization, using obtained vector as the input of decision Tree algorithms C4.5, by ratio shared by the important email in each leaf node Example is less than threshold alpha, then the node is judged as insignificant mail node；Train the mistake of an energy insignificant mail of filtration fraction Filter model；Mail is divided by important email and insignificant mail by using the first disaggregated model that C4.5 algorithm is established；For sentencing Break as important mail, calculate G Feature Words and be belonging respectively to the Bayesian probability of important email and insignificant mail, and will belong to Characteristic value dyad of the ratio as character pair word of the probability of important email and the probability for belonging to insignificant mail；For It is judged as insignificant mail, the characteristic value of G Feature Words is all assigned a value of 0 dyad；Using obtained vector as SVM The input of algorithm is established the secondary characteristics based on message body using the true class label of mail as target and extracts model；Pass through The model can extract the i-th secondary characteristics of the envelope mail based on message body, be expressed as f_i,3；

(6), it is modeled using secondary characteristics

The feature f that step (3), step (4), step (5) are obtained_i,1,f_i,2,f_i,3Form new vector V_i=(f_i,1,f_i,2, f_i,3), using the vector as the input of neural network algorithm, trains hidden layer and there was only one layer of two node, output layer only has one The neural network of a node, judges whether mail is important by the output numerical value size of output layer.