CN105447505A

CN105447505A - Multilevel important email detection method

Info

Publication number: CN105447505A
Application number: CN201510752497.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2016-03-30
Anticipated expiration: 2035-11-09
Also published as: CN105447505B

Abstract

The invention discloses a multilevel important email detection method that is established based on the information of the email address, the email subject, the email main body and the like. According to the method, firstly, the email address-based secondary feature of an email is extracted based on the bayes method. Secondly, the email subject-based secondary feature of the email is extracted based on the latent dirichlet allocation (LDA) and a support vector machine (SVM). Thirdly, the email main body-based secondary feature of the email is extracted based on the C4.5 and the SVM algorithm. Finally, based on the email address-based secondary feature of the email, the email subject-based secondary feature of the email and the email main body-based secondary feature of the email, a neural network model is trained. By means of the neural network model, the importance detection on the email is conducted. Therefore, the detection accuracy and the recall rate are higher.

Description

A kind of multi-level important email detection method

Technical field

The invention belongs to mail-detection technical field, more specifically say, relate to a kind of detection method of multi-level important email, be applicable to the application of the aspects such as important email detection, Spam filtering.

Background technology

Along with developing rapidly of Internet technology, undertaken communicating by internet also more and more frequent.And carry out communication by mail and become indispensable part in life, work, study.But, just while Email becomes a kind of indispensable important information media of communication gradually, also become a kind of commercial means and cause user to require a great deal of time from receiving the important email detecting oneself needs a large amount of mail.For above problem, at present more existing mail-detection algorithms, but its method is all more single, this causes testing result not accurate enough, especially when important email accounts for smaller, more difficultly meets application demand.Therefore improving the accuracy rate that important email detects, especially when important email accounts for smaller, is a hot issue of research at present.

In more existing solutions, comprise method, the method for Corpus--based Method study, the method etc. based on similarity cluster based on probability.Based on the method for probability, as the bayes method of classics, its principle calculates the conditional probability of each classification when this group property value given, and using the result of class label maximum for conditional probability as classification, the shortcoming adopting the method is that precondition generally can not meet; The method of Corpus--based Method study, as SVM, decision tree etc.SVM method is one of current reasonable process for sorting mailings, its principle is that mail attribute is mapped to higher dimensional space by kernel function, largest interval lineoid is set up in this higher dimensional space, the classification belonging to mail is decided according to the plane at mail place, its shortcoming is that the selection of kernel function has certain blindness, lack effective guidance, be difficult to select best kernel function for certain particular problem; Decision tree is a more efficient method, and its principle first property value is carried out discretize, then contributes by the value of discretize, carry out successively, until this branch meets predetermined requirement, otherwise continues, until this branches into single mail.Its shortcoming is easy to over-fitting.Based on the method for similarity cluster, as KNN, its principle calculates the distance between mail, and sample just thinks close to which classification which classification this envelope mail belongs to.Its shortcoming needs to calculate the distance between mail, and classification effectiveness is lower.

These methods have respective advantage, also have respective shortcoming simultaneously.When some accuracys rate require higher and important email relative with insignificant mail ratio greatly different, these methods can't meet the requirement of practical application.

Summary of the invention

The deficiency existed for prior art and defect, the invention provides a kind of multi-level mail-detection method utilizing the information such as email box address, mail matter topics and message body to set up.The method is set up secondary characteristics extraction model respectively for information such as email box address, theme, texts and utilizes this model to obtain secondary characteristics, then using the secondary characteristics that the obtains input neural network training model as neural network.This invention has been combined the method such as Bayes, LDA (implicit Di Li Cray distributes), SVM (support vector machine), decision tree, can reach good effect in detection important email.

Concrete steps of the present invention are as follows:

(1), mail pre-service

From the mail collected, important email and insignificant mail are randomly drawed common N by a certain percentage and are sealed mail, and stamp the label of " important email " or " insignificant mail " respectively according to the importance of mail itself.

(2) mail, for each envelope extracted, extracts three partial informations such as email address, mail matter topics and message body in mail by matching regular expressions algorithm or string matching algorithm.

(3), extract based on the right secondary characteristics of email address

(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail _i, then the set of all email addresses of N envelope mail can be expressed as A=A ₁∪ A ₂∪ ... .. ∪ A _n.Use freq ⁺(a _h, a _l) represent that email address is to (a _h, a _l) at the number of times being labeled as the appearance of important email email address centering, use freq ^-(a _h, a _l) represent that email address is to (a _h, a _l) at the number of times being labeled as the centering appearance of insignificant email box address, wherein, a _h, a _l∈ A and a _h, a _lfrom the email address of same envelope mail.Email address can be obtained to (a according to following formula _h, a _l) respectively important email email address to set and insignificant email box address to set in occur ratio p ⁺(a _h, a _l) and p ^-(a _h, a _l):

p^{+} (a_{h}, a_{l}) = \frac{{freq}^{+} (a_{h}, a_{l})}{N}

p^{-} (a_{h}, a_{1}) = \frac{{freq}^{-} (a_{h}, a_{l})}{N}

(3.2), use right set is included in important email email address to the part concentrated to represent the email address of the i-th envelope mail formation, is expressed as with right set is included in all insignificant email box addresses to the part in set to represent the email address of the i-th envelope mail formation, is expressed as then the i-th envelope mail is based on the right secondary characteristics f of email address _{i, 1}can be calculated as:

f_{i, 1} = \frac{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})}}{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})} + \sqrt[| A_{i}^{-} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{-}}{Π} p^{-} (a_{h}, a_{l})}}

Wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set.

(4) secondary characteristics based on mail matter topics, is extracted

(4.1), adopt Chinese character Words partition system to carry out participle to mail matter topics, from participle, choose noun, verb, adjective and adverbial word as Feature Words, obtain F Feature Words in mail.

(4.2) F the Feature Words, according to step (4.1) obtained adds up the word frequency dyad occurring this F Feature Words in the i-th envelope mail, obtains the vectorial X that N number of F ties up _i=(x _{i, 1}, x _{i, 2}..., x _i,F), 1≤i≤N, the vector matrix (TM) of the vectorial composing training mail of N number of F dimension _{n × F}.First by vector matrix (TM) _{n × F}set up topic model as LDA (implicit Di Li Cray distribute) algorithm, identify the potential subject information of mail, obtain by the output of topic model the vectorial X ' that N number of T ties up _i=(x ' _{i, 1}, x ' _{i, 2}..., x ' _i,T), form output matrix (TM_SVM) _{n × T}, wherein T is number of topics given in advance.Then the vector matrix (TM_SVM) will obtained _{n × T}as input, with the label of mail for target, utilize SVM (support vector machine) Algorithm for Training based on the disaggregated model of mail matter topics.Can obtain by this output based on the disaggregated model of mail matter topics the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f _{i, 2}.

(5) secondary characteristics of message body information, is extracted

(5.1), message body pre-service

Chinese character segmenter system is adopted to carry out participle to message body.From participle, choose noun and verb alternatively Feature Words according to part of speech, and then obtain the candidate feature set of words of training mail, then according to following formula:

χ^{2} (t, c) = \frac{N \times {(A D - C B)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}

Calculate the chi-square value of each candidate feature word, wherein, t represents candidate feature word, c represents classification (only having important and insignificant here), A represents the number of times that candidate feature word t occurs in c classification mail, and B represents the number of times that candidate feature word t occurs in non-c classification mail, and C represents the number of times occurring non-candidate Feature Words t in c classification mail, D represents the number of times occurring non-candidate Feature Words t in all non-c class mails, and N represents the size of training set.Get the Feature Words of front G large candidate feature word of chi-square value as subsequent treatment, can be filtered out those by the method and contribute little Feature Words to reduce the complexity calculated to classification.

(5.2), mail is just classified

According to G the Feature Words obtained in (5.1), calculate the tf-idf value dyad of the i-th envelope mail features word, obtain new vectorial Y _i=(y _{i, 1}, y _{i, 2}..., y _i,G), 1≤i≤N.By the vectorial Y obtained _ias the input of decision Tree algorithms C4.5, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha in each leaf node to ensure that the overall recall rate of important email is in higher level.Train the first disaggregated model of an energy filtration fraction insignificant mail, by the first disaggregated model set up, classification of mail is divided into important email and insignificant mail two class.

(5.3), secondary characteristics is extracted

By the first disaggregated model of (5.2) step, mail is divided into important email and insignificant mail.For being judged as important mail, calculate in (5.1) step the Bayesian probability that G the Feature Words obtained belongs to important email and insignificant mail respectively, and Bayesian probability Feature Words being belonged to important email and belong to insignificant mail the ratio of Bayesian probability as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be directly 0 by G the whole assignment of Feature Words characteristic of correspondence value, carry out vectorization equally, the new vector Z obtained _i=(z _{i, 1}, z _{i, 2}..., z _i,G), 1≤i≤N, by vector Z _ias the input of SVM algorithm, set up the disaggregated model based on message body using the true class label of mail as target.Can obtain by this output based on the disaggregated model of message body the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f _{i, 3}.

(6) secondary characteristics modeling, is utilized

By the secondary characteristics f that step (3), step (4), step (5) obtain _{i, 1}, f _{i, 2}, f _{i, 3}form new vectorial V _i=(f _{i, 1}, f _{i, 2}, f _{i, 3}), using the input of this vector as neural network, train hidden layer and only have one deck two nodes, output layer only has the neural network of a node, and the output interval of neural network is [0,1].If output valve is greater than threshold value θ, then this mail is important email, otherwise is inessential mail.

(7), important email detects

When predicting mail, application preceding step (3) (4) (5) obtains the secondary characteristics of mail based on address, mail matter topics, message body respectively, and the neural network classification model that recycling step (6) is set up detects mail to be identified.

Accompanying drawing explanation

The process flow diagram of Fig. 1 multi-level mail-detection method that to be the present invention be utilizes the information such as addresses of items of mail, mail matter topics and message body to set up.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described, so that those skilled in the art understands the present invention better.Requiring particular attention is that, in the following description, when perhaps the detailed description of known function and design can desalinate main contents of the present invention, these are described in and will be left in the basket here.

The mail that the present embodiment adopts comprises: the communication mail between the circular mail of International Academic Conference, the communication mail of international trade, enterprise, some advertisement matter, fishing mails etc. that network is propagated.According to actual conditions, using the mail of first three type as important email, rear two classes are as inessential mail.

In the example of this enforcement, the classification processing method of mail comprises the following steps:

(1), mail pre-service

N is randomly drawed from the mail collected ₁(N ₁=700) important email is sealed, N ₂(N ₂=7000) seal the common N (N=7700) of inessential mail and seal mail as training text, the N extracted is sealed mail to stamp the label of " important email " or " insignificant mail " respectively (important email is labeled as 1 here, insignificant mail is labeled as 0), the step ST1 in this step corresponding diagram 1.

(2), to each envelope mail, by matching regular expressions or string matching algorithm, email box address, mail matter topics and message body three partial information is extracted.The matching expression of addresses of items of mail is:

Reg_add＝“/^\w+([\.-]？\w+)*\w+([\.-]？\w+)*(\.\w{2,3})+$/”；

Mail matter topics information is according to the subject mark occurred in mail and content marker extraction; Message body is delete the address information that matches and the remaining information of subject information.This step is the step ST2 in Fig. 1.

(3) extract based on the right secondary characteristics of email address

(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail _i, then the set of all email addresses of N envelope mail can be expressed as A=A ₁∪ A ₂∪ ... .. ∪ A ₇₇₀₀, and calculate freq ⁺(a _h, a _l) and freq ^-(a _h, a _l), wherein, a _h, a _l∈ A and a _h, a _lfrom the email address of same mail.Email address can be obtained to (a according to following formula _h, a _l) respectively important email email address to set and insignificant email box address to set in occur ratio p ⁺(a _h, a _l) and p ^-(a _h, a _l):

p^{+} (a_{h}, a_{l}) = \frac{{freq}^{+} (a_{h}, a_{l})}{N}

p^{-} (a_{h}, a_{l}) = \frac{{freq}^{-} (a_{h}, a_{l})}{N}

(3.2), use right set is included in place important email email address to the part concentrated to represent the email address of the i-th envelope mail formation, can be expressed as with right set is included in all insignificant email box addresses to part in set to represent the email address of the i-th envelope mail formation, can be expressed as 1≤i≤7700, utilize following formula:

f_{i, 1} = \frac{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})}}{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})} + \sqrt[| A_{i}^{-} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{-}}{Π} p^{-} (a_{h}, a_{l})}}

Calculate the secondary characteristics of the i-th envelope mail based on email address, wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set.This step is the step ST4 in Fig. 1.

(4) secondary characteristics based on mail matter topics is extracted

(4.1) adopt Chinese character Words partition system to carry out participle to the theme often sealing mail, from participle, choose noun, verb, adjective and adverbial word as Feature Words, obtain F=205 Feature Words of mail, this step is the step ST5 in Fig. 1.

(4.2) F the Feature Words statistics obtained according to step (4.1) often seals the word frequency dyad occurring this F Feature Words in mail, obtains the vectorial X that N number of F ties up _i=(x _{i, 1}, x _{i, 2}..., x _{i, 205}), 1≤i≤7700, the vector matrix (TM) of composing training mail _{7700 × 205}.By vector matrix (TM) _{7700 × 205}the theme modeling of number of topics T=12 is set up in input as LDA (implicit Di Li Cray distributes) algorithm, is identified the potential subject information of mail by theme modeling, by the output of topic model, obtains the vectorial X ' of N number of 12 dimensions _i=(x ' _{i, 1}, x ' _{i, 2}..., x ' _{i, 12}), form output matrix (TM_SVM) _{7700 × 12}.Then the vector matrix (TM_SVM) will obtained _{7700 × 12}as input, adopt gaussian kernel function, utilize SVM (support vector machine) Algorithm for Training based on the secondary characteristics extraction model of mail matter topics.The secondary characteristics (implication of this feature is the probability that mail belongs to important email) based on mail matter topics can be extracted by this model, be expressed as f _{i, 2}.This step is the step ST6 in Fig. 1.

(5) secondary characteristics of message body information, is extracted

(5.1), message body pre-service

Adopt Chinese character segmenter system to carry out participle to message body, from participle, choose noun and verb alternatively Feature Words, and then obtain the candidate feature set of words of all training mails, then according to following formula:

χ^{2} (t, c) = \frac{N \times {(A D - C B)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}

Add up the chi-square value of each candidate feature word, wherein, t represents all candidate feature words, c represents classification (only having important and insignificant here), A represents the number of times that candidate feature word t occurs in c classification mail, and B represents the number of times that candidate feature word t occurs in non-c classification mail, and C represents the number of times occurring non-candidate Feature Words t in c classification mail, D represents the number of times occurring non-candidate Feature Words t in all non-c class mails, and N represents number N=7700 that whole training set contains mail.The Feature Words that before getting in all candidate feature words, G=230 chi-square value is large, can filter out those by the method and contribute little Feature Words to reduce the complexity calculated to classification.This step is the step ST7 in Fig. 1.

(5.2), filtrating mail

According to G=230 the Feature Words obtained in (5.1), calculate the tf-idf value dyad of often sealing mail features word, obtain new vectorial Y _i=(y _{i, 1}, y _{i, 2}..., y _{i, 230}), 1≤i≤7700.By the vectorial Y obtained _ias the input of decision Tree algorithms C4.5, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha=0.03 in each leaf node to ensure that the overall recall rate of important email is in higher level.By the first disaggregated model set up, classification of mail is divided into important email and insignificant mail two class.This step is the step ST8 in Fig. 1.

(5.3), secondary characteristics is extracted

By the first disaggregated model of (5.2) step, mail is divided into important email and insignificant mail.For being judged as important mail, calculate in (5.1) step the Bayesian probability that G the Feature Words obtained belongs to important email and insignificant mail respectively, and Bayesian probability Feature Words being belonged to important email and belong to insignificant mail the ratio of Bayesian probability as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be directly 0 by G the whole assignment of Feature Words characteristic of correspondence value, carry out vectorization equally, the new vector Z obtained _i=(z _{i, 1}, z _{i, 2}..., z _{i, 230}), 1≤i≤7700.By vector Z _ias the input of SVM algorithm, set up the disaggregated model based on message body using the true class label of mail as target.Can obtain by this output based on the disaggregated model of message body the probability that the i-th envelope mail belongs to important email, and using this probability as the secondary characteristics of this mail based on mail matter topics, be expressed as f _{i, 3}.This step is the step ST9 in Fig. 1.

(6) secondary characteristics modeling, is utilized

By the feature f that step (3), step (4), step (5) obtain _{i, 1}, f _{i, 2}, f _{i, 3}form new vectorial V _i=(f _{i, 1}, f _{i, 2}, f _{i, 3}), 1≤i≤7700, using the input of this vector as neural network, train hidden layer and only have one deck to comprise two nodes, output layer only has the neural network of a node, the interval [0,1] of the output valve of neural network.When the value exported by neural network is greater than θ, be then judged as important email, otherwise be judged as insignificant mail, by training analysis, θ gets 0.53.This step is the step ST10 in Fig. 1.

(7), mail prediction

In order to verify the effect based on multi-level important email detection method, we have employed the method for cross validation, in 7700 envelope mails of previous processed, randomly draw 80% as training set, and 20% carries out cross validation as checking collection.Repeat 100 times.Average Accuracy is 86.2%, and average recall rate is 90.3%.

Can be found out under important email and the unbalanced situation of inessential mail ratio by this result, compared with other E-mail sorting models, improve 15%-20%.Which illustrate the present invention, in fields such as important email identifications, there is good using value.

Although be described the illustrative embodiment of the present invention above; so that those skilled in the art understand the present invention; but should be clear; the invention is not restricted to the scope of embodiment; to those skilled in the art; as long as various change to limit and in the spirit and scope of the present invention determined, these changes are apparent, and all innovation and creation utilizing the present invention to conceive are all at the row of protection in appended claim.

Claims

1. a detection method for multi-level important email, is applicable to the embody rule such as important email detection, Spam filtering, has higher accuracy rate and recall rate, it is characterized in that, comprise the following steps:

(1), mail pre-service

From the mail collected, randomly draw N and seal mail, and stamp the label of " important email " or " insignificant mail " according to the importance of mail reality respectively;

(2) mail, for each envelope extracted, by the method for matching regular expressions algorithm or string matching, extracts three partial informations such as email address, mail matter topics and the message body in mail;

(3) secondary characteristics based on email address is extracted

(3.1) be, A by the transmitting-receiving email address set expression of the i-th envelope mail _i, then the set of all email addresses of N envelope mail can be expressed as A=A ₁∪ A ₂∪ ... .. ∪ A _n, use freq ⁺(a _h, a _l) represent that email address is to (a _h, a _l) total degree that occurs in the address set that formed in important email, use freq ^-(a _h, a _l) represent that email address is to (a _h, a _l) total degree that occurs in the address set that formed in insignificant mail, wherein, a _h, a _l∈ A mailbox and address a _h, a _lfrom same envelope mail; By email address to (a _h, a _l) respectively important email email address in and insignificant email box address centering occur ratio p ⁺(a _h, a _l) and p ^-(a _h, a _l) as the secondary characteristics based on email address, wherein

p^{+} (a_{h}, a_{l}) = \frac{{freq}^{+} (a_{h}, a_{l})}{N}, p^{-} (a_{h}, a_{l}) = \frac{{freq}^{-} (a_{h}, a_{l})}{N};

(3.2), use right set is included in important email email address to the part in set to represent the email address of the i-th envelope mail formation, is expressed as with right set is included in insignificant email box address to part in set to represent the email address of the i-th envelope mail formation, is expressed as then the i-th envelope mail is based on the right secondary characteristics f of email address _{i, 1}can be calculated as:

f_{i, 1} = \frac{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})}}{\sqrt[| A_{i}^{+} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{+}}{Π} p^{+} (a_{h}, a_{l})} + \sqrt[| A_{i}^{-} |]{\underset{(a_{h}, a_{l}) &Element; A_{i}^{-}}{Π} p^{-} (a_{h}, a_{l})}}

Wherein represent email address in the i-th envelope mail to being included in important email email address to the number in set, represent the i-th envelope email box address to being included in insignificant email box address to number in set;

(4) secondary characteristics based on mail matter topics, is extracted

(4.1), adopt Chinese character Words partition system to carry out participle to mail matter topics, from point set of words, choose noun, verb, adjective and adverbial word as Feature Words, obtain F Feature Words of mail; Statistics often seals the word frequency dyad occurring this F Feature Words in mail, will obtain the vector matrix (TM) of the vectorial composing training mail that N number of F ties up _{n × F}and set up topic model as the input of LDA (implicit Di Li Cray distributes) algorithm, identify the hiding subject information of mail, and using the input of the N number of T dimension obtained from the training of LDA topic model (T be the theme number) vector (output of topic model) as SVM, using mail classes label as target, utilize SVM (support vector machine) Algorithm for Training based on the secondary characteristics extraction model of mail matter topics; The secondary characteristics of the i-th envelope mail can be extracted by this model, be expressed as f _{i, 2};

(5) secondary characteristics of message body information, is extracted

Adopt Chinese character segmenter system to carry out participle to message body, and calculate the chi-square value of each participle, choose the large participle of a front G chi-square value as Feature Words by the size of chi-square value; Calculate the tf-idf value dyad of often sealing corresponding this G Feature Words of mail, using the input as decision Tree algorithms C4.5 of the vector that obtains, ratio shared by important email in each leaf node is less than threshold alpha, then this node is judged as insignificant mail node, by controlling threshold alpha in each leaf node to ensure that the overall recall rate of important email is in higher level; Train the filtering model of an energy filtration fraction insignificant mail; By the first disaggregated model using C4.5 algorithm to set up, mail is divided into important email and insignificant mail; For being judged as important mail, calculate the Bayesian probability that G Feature Words belongs to important email and insignificant mail respectively, and using the probability belonging to important email with belong to the ratio of probability of insignificant mail as the eigenwert dyad of character pair word; For the mail being judged as insignificant, be 0 dyad by whole for the eigenwert of G Feature Words assignment; Using the input as SVM algorithm of the vector that obtains, set up the secondary characteristics extraction model based on message body using the true class label of mail as target; The i-th envelope mail can be extracted based on the secondary characteristics of message body by this model, be expressed as f _{i, 3};

(6) secondary characteristics modeling, is utilized

By the feature f that step (3), step (4), step (5) obtain _{i, 1}, f _{i, 2}, f _{i, 3}form new vectorial V _i=(f _{i, 1}, f _{i, 2}, f _{i, 3}), using the input of this vector as neural network algorithm, train hidden layer and only have one deck two nodes, output layer only has the neural network of a node, judges whether mail is important by the output numerical values recited of output layer.

2. multi-level important email detection method according to claim 1, it is characterized in that mail being divided into addresses of items of mail, mail matter topics and message body three partial information described in step (2), then adopt diverse ways to process separately to this three partial information and obtain corresponding secondary characteristics.

3. multi-level important email detection method according to claim 1, is characterized in that the email address of method to mail of the utilization statistics described in step (3) is added up information, obtains the secondary characteristics based on email box address.

4. multi-level important email detection method according to claim 1, is characterized in that the secondary characteristics utilizing LDA and SVM to obtain based on mail matter topics described in step (4).

5. multi-level important email detection method according to claim 1, it is characterized in that the utilization card side described in step (5) carries out feature selecting, C4.5 algorithm is utilized tentatively to filter mail subsequently, recycling SVM algorithm sets up Further Feature Extraction modeling to message body, obtains the secondary characteristics based on message body.

6. multi-level important email detection method according to claim 1, is characterized in that the secondary characteristics utilizing step (3) (4) (5) to obtain described in step (6) trains a neural network model to classify to mail.