CN103984703B

CN103984703B - Mail classification method and device

Info

Publication number: CN103984703B
Application number: CN201410163082.1A
Authority: CN
Inventors: 陈玉焓
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2014-04-22
Filing date: 2014-04-22
Publication date: 2017-04-12
Anticipated expiration: 2034-04-22
Also published as: CN103984703A

Abstract

The invention discloses a mail classification method and device. The method comprises the following steps: with regard to the class of each mail, calculating the possibility that a mail to be classified belongs to the mail class; taking the calculated possibility as the possibility of the class corresponding to the mail; sequencing the calculated possibility corresponding to each mail class and judging whether a feature word of the mail to be classified comprises at least one keyword of the mail class corresponding to the maximum possibility or not; if so, classifying the mail to be classified into the mail class corresponding to the maximum possibility; otherwise, calculating a difference value between the maximum possibility and a second possibility and a specific value between the difference value and the maximum possibility; if the specific value is less than a set rate threshold value and the feature word of the mail to be classified contains at least one keyword of the mail class corresponding to the second possibility, classifying the mail to be classified to the mail class corresponding to the second possibility. Therefore, the keyword of the set mail class enables the mail classification to be more accurate.

Description

Process for sorting mailings and device

Technical field

The present invention relates to internet arena, more particularly to a kind of process for sorting mailings and device.

Background technology

Email progressively transmission information on network using storage-pass-through mode, with spread speed soon, communicatee Extensively, with low cost the features such as.In the current internet information epoch, people are exchanged by Email or are communicated Behavior it is more and more universal.

Generally, in the mailbox of Email User include polytype mail, such as, and Shang Xun, social activity, order, recruitment, The class mail such as training organization, bank financing, and common dialogue mail (mail mutually greeted such as between friend).If user Inbox in business's news the class mail such as to promote excessive, then can cause the excessive problem of customer complaint, and mail is indiscriminate In being delivered to the inbox of user, it is mixed in together to may result in various types of mails in the inbox of user, so as to give User checks that mail needed for reading causes puzzlement.Therefore, mailing system is often classified to mail, mail is divided into many Classification is planted, so that user obtains preferably mailbox experience.For example, gmail mailboxes have outside common inbox advertisement matter, Website multidate information mail etc., qq mailboxes have subscription mail etc. outside common inbox.

At present, a kind of existing process for sorting mailings is mainly based upon clustering algorithm：According to the mail of training sample mail Data carry out the Feature Words obtained after participle, and training sample mail is divided into into some mail classes, and separately constitute some postals The mail data sample set of part classification；Afterwards, according to the Feature Words of the mail data of mail to be sorted, mail category to be sorted is calculated In the probability of the mail data sample set of each mail classes, using the mail classes corresponding to maximum probability as postal to be sorted The mail classes of part, and mail to be sorted is divided in the mail data sample set of the mail classes.Wherein, mail data leads to It is often Mail Contents.

However, it was found by the inventors of the present invention that the process for sorting mailings degree of accuracy of prior art is relatively low, it may appear that some postals The phenomenon of part classification erroneous judgement, and user is caused from viewing required mail in time：Such as, user may during hunting for a job More it is concerned about recruitment class mail, recruitment class mail may be but divided in training organization's class mail, be made by the method for prior art Obtaining user can not obtain in time recruiting the information of class mail；For another example, common dialogue mail is divided into into business's news class mail, may So that user cannot in time check the common dialogue mail of these erroneous judgements, to user very big inconvenience is brought.Therefore, it is necessary to carry For a kind of process for sorting mailings that more accurately can be classified to mail.

The content of the invention

For the defect that above-mentioned prior art is present, the invention provides a kind of process for sorting mailings and device, to carry The accuracy of high mail classification.

According to an aspect of the invention, there is provided a kind of process for sorting mailings, including：

For predetermined each mail classes, according to the Feature Words of mail to be sorted, the postal to be sorted is calculated Part belongs to after the probability of the mail classes, using the probability for calculating as to should mail classes probability；

The probability of each mail classes of the correspondence for calculating is ranked up, and is judged in the Feature Words of the mail to be sorted Whether at least one keyword of mail classes maximum probability corresponding to is included；If so, then by the mail to be sorted stroke In assigning to the mail classes corresponding to the probability of maximum；Otherwise：

The difference of the probability of maximum and the probability of sequence second is calculated, and calculates the ratio of the difference and the probability of maximum Value；If it is determined that the ratio for calculating in setting rate threshold value, and the Feature Words of the mail to be sorted less than including sequence the At least one keyword of the mail classes corresponding to two probability, then be divided into the general of sequence second by the mail to be sorted In mail classes corresponding to rate.

Calculate the mail to be sorted and belong to before the probability of the mail classes it is preferred that described, also include：

The Feature Words being contained in the Feature Words for determining the mail to be sorted in the feature lexicon of the mail classes Number, calculates the total ratio of the number and Feature Words of the mail to be sorted determined, as the mail to be sorted There is ratio in Feature Words under the mail classes；And confirm that Feature Words of the mail to be sorted under the mail classes occur Rate threshold of the ratio more than setting.

Wherein, the keyword of the mail classes is predetermined：

For each mail classes, for each Feature Words in the feature lexicon of the mail classes, this is counted in advance The quantity of the sample post for including this feature word in mail classes simultaneously carries out descending sequence；By the forward setting number that sorts Feature Words as the mail classes keyword.

It is preferred that for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculating described treating Mail classifying belongs to the probability of the mail classes, specifically includes：

Remember that i-th mail classes is C_i, n Feature Words of the mail to be sorted are respectively F₁,F₂,...,F_n, calculate Such as the value of following formula 1, as the mail to be sorted the probability of i-th mail classes is belonged to：

P(C_i)P(F₁|C_i)P(F₂|C_i)...P(F_n|C_i) (formula 1)

In formula 1,

Wherein, k takes the natural number between 1～n；It is characterized word F_kIn mail classes C_iMail data sample set in go out Existing number of times；For mail classes C_iFeature lexicon in each Feature Words in mail classes C_iMail data sample set in go out Existing number of times sum；For mail classes C_iMail data sample set in sample post quantity；S is each mail classes The quantity sum of the sample post in mail data sample set.

Wherein, the feature lexicon of the mail classes is obtained according to following method：

For each mail classes, participle is carried out to the sample post in the mail data sample set of the mail classes, and Number of times that each word after participle occurs in the mail data sample set of the mail classes is counted as the word of the word Frequently；After removing the uncommon word and stop words in each word after participle, by word frequency more than setting lower threshold, less than the setting upper limit The word of threshold value is defined as the alternative word of the mail classes；By part-of-speech information in the alternative word of the mail classes and part-of-speech information table The alternative word that the part-of-speech information of middle record matches, is defined as the Feature Words of the mail classes, each Feature Words of the mail classes Constitute the feature lexicon of the mail classes；

Wherein, the mail data sample set of each mail classes is the similarity between the characteristic vector according to sample post, Based on clustering algorithm mark off come.

It is preferred that the Feature Words of the mail to be sorted are specifically included：Carry from the mail header of the mail to be sorted The title feature word of taking-up, and the content characteristic word extracted from the Mail Contents of the mail to be sorted；And

The Feature Words according to mail to be sorted, calculate the probability that the mail to be sorted belongs to the mail classes, Specifically include：

According to the title feature word of the mail to be sorted, the mail header for calculating the mail to be sorted belongs to the postal After the probability of part classification, using the probability as to should mail classes title probability；And

According to the content characteristic word of the mail to be sorted, the Mail Contents for calculating the mail to be sorted belong to the postal After the probability of part classification, using the probability as to should mail classes content probability；And

The probability by each mail classes of the correspondence for calculating is ranked up, and judges the feature of the mail to be sorted Whether at least one keyword of mail classes maximum probability corresponding to is included in word；If so, then by the postal to be sorted Part is divided in the mail classes corresponding to the probability of maximum, is specifically included：

The title probability of each mail classes of the correspondence for calculating is ranked up, if judging the title of the mail to be sorted Feature Words include at least one keyword of the mail classes corresponding to the title probability of maximum, then by maximum title probability To be determined mail classes of the corresponding mail classes as correspondence mail header；And

The content probability of each mail classes of the correspondence for calculating is ranked up, if judging the content of the mail to be sorted Feature Words include the keyword of the mail classes corresponding to the content probability of maximum, then by corresponding to maximum content probability To be determined mail classes of the mail classes as correspondence Mail Contents；

If the mail classes to be determined of the mail classes to be determined and the corresponding Mail Contents for corresponding to mail header It is identical, then the mail to be sorted is divided in the mail classes to be determined.

It is preferred that it is described calculate maximum probability and sequence second probability difference, and calculate the difference with most After the ratio of big probability, also include：

If judging the described difference and the ratio of the probability of maximum not less than the setting rate threshold value, treat described point Class mail is defined as talking with mail；

If judging, the described difference is less than the setting rate threshold value, and the postal to be sorted with the ratio of the probability of maximum Do not include the keyword of the mail classes corresponding to the probability of sequence second in the Feature Words of part, then：

The described difference is calculated into maximum with the ratio of the probability of maximum as after the first class probability rate, further The difference of probability and the probability of sequence the 3rd, using the ratio of the difference and the probability of maximum as the second class probability rate；If Determine that the second class probability rate includes sequence less than described setting in rate threshold value, and the Feature Words of the mail to be sorted At least one keyword of the mail classes corresponding to the 3rd probability, then be divided into sequence the 3rd by the mail to be sorted In mail classes corresponding to probability.

According to another aspect of the present invention, a kind of mail sorter is additionally provided, including：

Probability evaluation entity, for for predetermined each mail classes, according to the Feature Words of mail to be sorted, meter Calculate the mail to be sorted to belong to after the probability of the mail classes, using the probability for calculating as to should mail classes it is general Rate；

Order module, for the probability of each mail classes of the correspondence for calculating to be ranked up, obtains ranking results；

Category division module, for whether including in the ranking results most in the Feature Words for judging the mail to be sorted At least one keyword of the mail classes corresponding to big probability；If so, then by the mail to be sorted be divided into maximum In mail classes corresponding to probability；Otherwise：Probability maximum in the ranking results is calculated with the probability of sequence second After difference, the ratio of the difference and the probability of maximum is calculated；If it is determined that the ratio for calculating is less than setting rate threshold value, and it is described At least one keyword of the mail classes corresponding to the probability of sequence second is included in the Feature Words of mail to be sorted, then will The mail to be sorted is divided in the mail classes corresponding to the probability of sequence second.

Further, the mail sorter, also includes：

There is ratio anticipation module in Feature Words, for for predetermined each mail classes, determining described treat point The number of the Feature Words being contained in the Feature Words of class mail in the feature lexicon of the mail classes, calculate the number determined with The total ratio of the Feature Words of the mail to be sorted, goes out as Feature Words of the mail to be sorted under the mail classes Existing ratio；And confirm that rate threshold of the ratio more than setting occur in Feature Words of the mail to be sorted under the mail classes When, trigger the probability evaluation entity.

If it is preferred that the category division module is additionally operable to judge that the described difference is not less than with the ratio of the probability of maximum The setting rate threshold value, then be defined as in the mail to be sorted talking with mail；If judging, the described difference is general with maximum The ratio of rate is less than the probability institute for not including sequence second in setting the rate threshold value, and the Feature Words of the mail to be sorted The keyword of corresponding mail classes, then：Using the ratio of the described difference and the probability of maximum as the first class probability rate Afterwards, the difference of probability maximum in the ranking results and the probability of sequence the 3rd is further calculated, by the difference and maximum The ratio of probability is used as the second class probability rate；It is determined that the second class probability rate less than it is described setting rate threshold value, and At least one keyword of the mail classes corresponding to the probability of sequence the 3rd is included in the Feature Words of the mail to be sorted In the case of, the mail to be sorted is divided in the mail classes corresponding to the probability of sequence the 3rd.

In technical scheme, due to setting keyword respectively for each mail classes, by mail to be sorted category In the probability of each mail classes, mail classification is carried out in combination with the keyword of mail classes, so as to avoid mail to be sorted In the impact of accuracy that mail is classified of some non-key words, and the meter of the class probability rate based on mail to be sorted Calculate, when being divided in mail to be sorted in mail classes corresponding to the probability of maximum, it is ensured that mail classification still has There is higher accuracy.

Further, there is the calculating of ratio in Feature Words of the mail to be sorted in the present invention under each mail classes, can Calculating in simplify mail assorting process, and the accuracy of certified mail classification；And, respectively according to the postal of mail to be sorted Part theme and Mail Contents carry out mail classification, can be further ensured that the accuracy that mail is classified.

Description of the drawings

Fig. 1 is the stream of the method for the determination mail classes and its mail data sample set and feature lexicon of the embodiment of the present invention Cheng Tu；

Fig. 2 a, 2b are the flow chart of the process for sorting mailings of the embodiment of the present invention；

Fig. 3 is the internal structure block diagram of the mail sorter of the embodiment of the present invention.

Specific embodiment

To make the objects, technical solutions and advantages of the present invention become more apparent, referring to the drawings and preferred reality is enumerated Example is applied, the present invention is described in more detail.However, it is necessary to explanation, many details listed in specification are only to be Reader is set to have a thorough explanation to the one or more aspects of the present invention, even without these specific details can also Realize the aspects of the invention.

The term such as " module " used in this application, " system " is intended to include the entity related to computer, for example but does not limit In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to：Process Process, processor, object, executable program, the thread for performing, program and/or the computer run on device.For example, count The application program and this computing device run on calculation equipment can be module.One or more modules may be located at executory In one process and/or thread, a module can also be located on a computer and/or be distributed in two or more multiple stage is calculated Between machine.

It was found by the inventors of the present invention that be the reason for the method erroneous judgement mail of prior art, when the mail of certain envelope mail When including the feature of more certain not representative mail classes in content, the mail for calculating is may be such that Belong to the maximum probability of the mail classes, if the mail is divided in the mail classes may be inaccurate.For example, if two Dialogue mail between friend, refers to the situation that inquiry works each other, and causes to include welfare, treatment, position in Mail Contents Deng word, and these words may belong to some features of recruitment class mail, and the method for prior art may be missed the postal Part is divided in recruitment class mail.

Thereby take into account, can in advance be respectively each mail classes setting classifying rules, will some be more representative Word be set as the keyword of mail classes.For example, one or several words of " work ", " resume ", " recruitment " etc. are set It is set to the keyword of recruitment class mail.So, obtain mail to be sorted and belong to the probability of each mail classes, and determine maximum Probability corresponding to mail classes after, first judge the key for whether including the mail classes in the Feature Words of mail to be sorted Word, shows that mail to be sorted does not meet the classifying rules of the mail classes if not, can be according to the probability sorted in front two Difference (referred to herein as class probability rate) and come the keyword of mail classes corresponding to second probability, it is determined that Whether mail to be sorted is divided in the mail classes corresponding to the probability for coming second.So as to the pass based on mail classes Keyword and class probability rate, can more precisely treat mail classifying and be classified.

Describe technical scheme in detail below in conjunction with the accompanying drawings.In the embodiment of the present invention, mail is carried out it is classified Before, some mail classes can be predefined out (such as business's news, social activity, bank card, recruitment information, sequence information, registration letter Breath, news) and each mail classes mail data sample set and feature lexicon, so as in predetermined mail classes On the basis of, treat mail classifying and classified.Specifically, some mail classes and the mail of each mail classes are predefined The flow process of the method for set of data samples and feature lexicon, as shown in figure 1, specifically including following steps：

S101：For each sample post in mail set to be trained, the set of words of the sample post is obtained, according to The set of words of each sample post for obtaining determines the set of words of mail set to be trained, and then determines the sample post Word feature vector.

Specifically, can be from the non-conversational postal that quantity is extracted in setting time section or set enter in mailbox of mail server The sample post of part, using these sample posts as set element mail set to be trained is constituted.For mail set to be trained In each sample post, participle is carried out to the mail data (including mail header and Mail Contents) of the sample post, and go Stop words and rarely used word in each word marked off except Jing participles, obtains the set of words of the sample post.Will postal be trained The set of words of each sample post in part set is merged into same set of words, that is, get rid of the word of each sample post Because repeating and the word of redundancy in set, the set of words of the mail set to be trained is obtained.

For each sample post in mail set to be trained, by the word in the set of words of mail set to be trained Sum as the word feature vector of the sample post dimension, and by each word in the set of words of mail set to be trained Language, corresponds to respectively each vector element of the word feature vector of the sample post；For the sample post word feature to Each vector element in amount, the determination method of the vector element value is as follows：If the mail to be trained corresponding to the vector element Word in the set of words of set is included in the set of words of the sample post, then the vector element value is set to 1；Otherwise The vector element value is set to 0.For example, the word feature vector embodiments of a sample post in mail set to be trained For D=[d₁,…,d_j,..,d_L], wherein d_jValue be 1 or 0, take in 1 set of words for representing mail set to be trained the J word is included in the set of words of current sample post, is taken j-th in 0 set of words for representing mail set to be trained Word is not included in the set of words of current sample post；Wherein, j is the natural number of 1～L, and L is mail set to be trained The word sum of set of words.

S102：Similarity between the word feature vector of the sample post in mail set to be trained, using poly- The sample post that class algorithm is treated in training mail set is clustered, and obtains some clusters.

Specifically, cosine similarity computational methods can be generally adopted, the word for calculating any two sample post is special Levy the similarity between the similarity between vector, that is, any two sample post.For example, sample post x and sample postal The word feature vector of part y is respectively X=[x₁,…,x_j,..,x_L] and Y=[y₁,…,y_j,..,y_L], can be according to equation below 2 Calculate similarity Sim (X, Y) between the characteristic vector of sample post x and sample post y：

So, in this step, can be according between the word feature of the sample post in mail set to be trained vector Similarity, builds similarity matrix, and treats the sample in training mail set using clustering algorithm (such as hierarchical clustering algorithm) This mail is clustered, and obtains meeting some clusters of cluster termination condition set in advance.For example, cluster termination condition can be set The maximum similarity being set between cluster reaches setting similarity threshold, or the quantity of the sample post in cluster reaches setting number Value.Wherein, build similarity matrix and cluster is carried out using clustering algorithm and be well known to those skilled in the art, herein no longer Repeat.

S103：For each cluster that cluster is obtained, the sample post included in the cluster is divided into into same mail classes In, and the sample post of each mail classes is constituted into the mail data sample set of the mail classes.

S104：For the mail data sample set of each mail classes, the mail data sample set of the mail classes is extracted In sample post Feature Words, and then obtain the feature lexicon of the mail classes.

In this step, for the mail data sample set of each mail classes obtained in above-mentioned steps S103, extract The Feature Words of the sample post in the mail data sample set of the mail classes, specially：Mail data to the mail classes Each sample post in sample set carries out participle, counts mail data sample of each word after participle in the mail classes The number of times of appearance is concentrated as the word frequency of the word；After removing the uncommon word and stop words in each word after participle, by word frequency It is defined as the alternative word of the mail classes more than setting lower threshold, less than the word of setting upper limit threshold；By the mail classes Alternative word in the alternative word that matches of the part-of-speech information that records in part-of-speech information and part-of-speech information table, be defined as the mail classes Feature Words, each Feature Words of the mail classes constitute the feature lexicon of the mail classes.Wherein, participle is carried out to sample post Namely participle is carried out to the mail header and Mail Contents of sample post；Promising raising mail classification is recorded in part-of-speech information table Accuracy and the part of speech determined, such as adverbial word function idiom, noun, adjective, verb, time, place morpheme, measure word, Filter out name, auxiliary word etc..

Based on predetermined each mail classes, the flow process of process for sorting mailings provided in an embodiment of the present invention, such as scheme Shown in 2a, 2b, following steps are specifically included：

S201：For mail to be sorted, the Feature Words in the mail data of mail to be sorted are extracted；Make i=1.

Specifically, can treat mail classifying using existing segmentation methods carries out participle, removes each word after participle In uncommon word and stop words after, obtain the Feature Words of mail to be sorted.

Increase simultaneously classification accuracy to simplify to calculate, can be right after step S201, and in step S202～S205 In predetermined each mail classes, calculate Feature Words of the mail to be sorted under the mail classes and ratio occur, and true Make Feature Words of the mail to be sorted under the mail classes and ratio occur less than setting after rate threshold, calculate postal to be sorted Part belongs to the probability of the mail classes.

S202：For predetermined i-th mail classes, spy of the mail to be sorted under i-th mail classes is calculated Levy word and ratio occur.

Specifically, predetermined i-th mail classes can be included in the Feature Words for determining mail to be sorted The number of the Feature Words in the feature lexicon of i-th mail classes, calculates the feature of the number and mail to be sorted determined Are there is ratio by the total ratio of word in Feature Words of the ratio for calculating as mail to be sorted under i-th mail classes. Wherein, ratio occur in Feature Words of the mail to be sorted under certain mail classes can reflect that the Feature Words of mail to be sorted are occurred in Number in the feature lexicon of the mail classes, that is, can reflect that mail to be sorted belongs to the possibility of the mail classes；If Feature Words of the mail to be sorted under certain mail classes occur that ratio is less, then mail to be sorted belongs to the general of the mail classes Rate is less；Otherwise the probability that mail to be sorted belongs to the mail classes is larger.Wherein, i takes the natural number between 1～m, and m is pre- The number of the mail classes for first determining.

S203：Feature Words of the mail to be sorted under i-th mail classes into ratio occur is carried out with setting rate threshold Compare, and judge whether comparative result is that Feature Words of the mail to be sorted under i-th mail classes ratio occur more than setting Rate threshold；If so, then execution step S204；Otherwise, step S205 is jumped to.

That is, judging that whether Feature Words of the mail to be sorted under i-th mail classes ratio occur more than setting Rate threshold；If so, the probability that mail to be sorted belongs to i-th mail classes is then calculated；Otherwise, directly judge i whether etc. In m, and the calculating that mail to be sorted belongs to the probability of i-th mail classes is not carried out, while classification accuracy is ensured, Simplify calculating.

S204：According to the Feature Words of mail to be sorted, the probability that mail to be sorted belongs to i-th mail classes is calculated Afterwards, using the probability for calculating as the probability for corresponding to i-th mail classes.

If comparative result is Feature Words of the mail to be sorted under i-th mail classes there is ratio more than setting ratio threshold Value, then calculate the probability that mail to be sorted belongs to the mail classes.

Specifically, existing NB Algorithm can be based on, it is assumed that separate between the Feature Words of mail to be sorted, And remember that i-th mail classes is C_i, n Feature Words of mail to be sorted are respectively F₁,F₂,...,F_n, then based on naive Bayesian Algorithm, mail to be sorted belongs to i-th mail classes C_iThe probability P (C that are represented by equation below 3_i|F₁,F₂,..., F_n)：

Due to separate between the Feature Words of mail to be sorted, therefore：

P(F₁,F₂,...,F_n|C_i)=P (F₁|C_i)P(F₂|C_i)...P(F_n|C_i)；

And：

P(F₁,F₂,...,F_n)=P (F₁)P(F₂)...P(F_n)；

For each mail classes, P (F₁,F₂,...F_n) it is identical, therefore：

P(C_i|F₁,F₂,...F_n)∝P(C_i)P(F₁|C_i)P(F₂|C_i)...P(F_n|C_i)；

So as to P (C can will be calculated_i|F₁,F₂,..F_n) it is converted into calculating P (C_i) and P (F_k|C_i), therefore, can calculate as follows The value of formula 1, as mail to be sorted the probability of i-th mail classes is belonged to：

P(C_i)P(F₁|C_i)P(F₂|C_i)...P(F_n|C_i) (formula 1)

In formula 1,

And, there is the use of this judgement factor of ratio in the Feature Words of mail to be sorted, based on above-mentioned simple shellfish Leaf this algorithm carries out mail to be sorted and belongs to i-th mail classes C_iProbability calculating and and then carry out mail classify when, keep away Exempt from mail to be sorted certain Feature Words in mail classes C_iMail data sample set in the number of times that occurs it is higher and affect The situation that mail classes determine；For example, Feature Words F₁In mail classes C_iMail data sample set in occur number of times it is very big, And further feature base is not originally in mail classes C_iMail data sample set in occur, then may be because of P (F₁|C_i) it is larger and So that P (C_i)P(F₁|C_i)P(F₂|C_i)...P(F_n|C_i)P(F₁|C_i) larger, and then cause the classification of mail to be sorted not accurate enough Really, and there is ratio in Feature Words that this judges that the use of the factor can well avoid the appearance of this kind of situation.

Additionally, being also obtained after the word feature of mail to be sorted vector, the word feature vector of mail classifying is treated In each element be normalized, and calculate the characteristic vector of mail to be sorted and i-th mail classes C_iIn each sample Similarity between the characteristic vector of mail, and then calculate the mean value of each similarity, using the mean value for calculating as treating point Class mail belongs to the probability of i-th mail classes.

S205：Judge whether i is equal to m；If so, then execution step S206；Otherwise, make after i=i+1, jump to step S202。

Specifically, m is the number of predetermined mail classes, if i=m, shows to treat mail classifying at each There is ratio and are calculated in Feature Words under mail classes.If i ≠ m, i=i+1 is made, jump to step S202, calculated and treat point There is ratio in Feature Words of the class mail under next (i+1) mail classes.

S206：After the probability of each mail classes of the correspondence for calculating is ranked up, the Feature Words of mail to be sorted are judged In whether include at least one keyword of mail classes corresponding to maximum probability；If so, then execution step S207；It is no Then, execution step S210.

Specifically, for each mail classes, it is previously stored with the antistop list of the mail classes, and the mail classes Keyword in antistop list is typically predetermined, is specifically as follows：For each mail classes, for the mail classes Feature lexicon in each Feature Words, count in advance in the mail classes comprising this feature word sample post quantity simultaneously Carry out descending sequence；Using the Feature Words of the forward setting number of sorting as the mail classes keyword.Or, may be used also Rule of thumb to be set respectively to the keyword of each mail classes by those skilled in the art.For example, by " work ", " resume " " recruitment " etc. is set as recruiting the keyword of class mail.

In this step, the probability of each mail classes of the correspondence for calculating is carried out into descending sequence, and judges to treat point Whether include at least one keyword of the mail classes corresponding to maximum probability in the Feature Words of class mail, that is, determine Whether mail to be sorted meets the classifying rules of the mail classes corresponding to the probability of maximum；If in the Feature Words of mail to be sorted Including one or more in the keyword of the mail classes corresponding to maximum probability, then show that mail to be sorted meets maximum Probability corresponding to mail classes classifying rules, directly mail to be sorted can be divided in the mail classes；If treating point Do not include the keyword of the mail classes corresponding to maximum probability in the Feature Words of class mail, then show to draw in mail to be sorted Assign to maximum probability corresponding to mail classes in it is not accurate enough, can be according to following step S210～S216 process.

S207：Mail to be sorted is divided in the mail classes corresponding to the probability of maximum.

If it is determined that the Feature Words of mail to be sorted are included in the keyword of the mail classes corresponding to the probability of maximum One or more, then be divided into mail to be sorted in the mail classes of maximum probability.

S210：Make h=2.Wherein, h is the natural number more than or equal to 2 and less than or equal to m.

S211：After calculating the difference of the probability of maximum and the probability of sequence h, the difference is calculated with the probability of maximum Ratio, as h-1 class probability rates.

For example, during h=2, after calculating the difference of the probability of maximum and the probability of sequence second, the difference is calculated with maximum Probability ratio, as the first class probability rate；If it is, the probability of maximum is P₁, the probability of sequence second is P₂, First class probability rate of mail to be sorted is P_d1=(P₁-P₂)/P₁。

For another example, the probability of sequence the 3rd is P₃, then the second class probability rate is P_d2=(P₁-P₃)/P₁。

S212：Judge h-1 class probabilities rate whether less than setting rate threshold value；If so, then execution step S213；It is no Then, execution step S216.

Wherein, setting rate threshold value can be set by those skilled in the art according to the situation that actual mail is classified.

S213：Judge in the Feature Words of mail to be sorted whether to include mail classes corresponding to the probability of sequence h At least one keyword；If so, then execution step S214；Otherwise, execution step S215.

For example, during h=2, if the first class probability rate judges the feature of mail to be sorted less than setting rate threshold value Whether at least one keyword of the mail classes probability of sequence second corresponding to is included in word.

S214：Mail to be sorted is divided in the mail classes corresponding to the probability of sequence h.

If the h-1 class probabilities rate of mail to be sorted is judged in step S212 less than setting rate threshold value, Mail to be sorted is divided in the mail classes corresponding to the probability of sequence h.

For example, during h=2, judge that the first class probability rate is less than setting rate threshold value, and the feature of mail to be sorted At least one keyword of the mail classes corresponding to the probability of sequence second is included in word, is then divided in mail to be sorted In mail classes corresponding to the probability of sequence second.

For another example, during h=3, it is determined that the second class probability rate is less than setting rate threshold value, and the feature of mail to be sorted Include in word sequence the 3rd probability corresponding to mail classes at least one keyword in the case of, by mail to be sorted It is divided in the mail classes corresponding to the probability of sequence the 3rd.

S215：Judge whether h is equal to m；If so, then execution step S216；Otherwise, make after h=h+1, jump to step S211。

Specifically, if h-1 class probabilities rate is not wrapped less than setting in rate threshold value, and the Feature Words of mail to be sorted The keyword of the mail classes corresponding to the probability of sequence h is included, then further calculates maximum probability general with sequence h+1 The difference of rate, using the ratio of the difference and the probability of maximum as h class probability rates, and according to h class probability rates Mail is classified.

S216：Mail to be sorted is defined as to talk with mail.

If the h-1 class probabilities rate that mail to be sorted is judged in step S212 sets poor not less than (being more than or equal to) Rate threshold value, then by mail to be sorted be defined as common dialogue mail in this step.For example, during h=2, if the first classification is general Rate rate is then defined as in mail to be sorted to talk with mail not less than setting rate threshold value.

Or, after judging h=m in step S215, in this step mail to be sorted is defined as into common dialogue Mail.

More preferably, the present invention can also treat the mail header and Mail Contents of mail classifying and carry out participle respectively, from treating Title feature word is extracted in the mail header of mail classifying, and content characteristic is extracted from the Mail Contents of mail to be sorted Word；In other words, the Feature Words of mail to be sorted specifically include title feature word and content characteristic word.And, treat point extracting After the title feature word and content characteristic word of class mail, can be calculated to be sorted according to the title feature word of mail to be sorted The mail header of mail belongs to the probability of the mail classes, using the probability as to should mail classes title probability；And root According to the content characteristic word of mail to be sorted, the Mail Contents for calculating mail to be sorted belong to the probability of the mail classes, by this Probability as to should mail classes content probability.Afterwards, the title probability of each mail classes of the correspondence for calculating is carried out Sequence, if judging the title feature word of mail to be sorted at least of the mail classes corresponding to the title probability of maximum is included Individual keyword, then using the mail classes corresponding to maximum title probability as the mail classes to be determined for corresponding to mail header； And be ranked up the content probability of each mail classes of the correspondence for calculating, if wrapping in judging the content characteristic word of mail to be sorted The keyword of the mail classes corresponding to the content probability of maximum is included, is then made the mail classes corresponding to maximum content probability For the mail classes to be determined of correspondence Mail Contents；If the mail classes to be determined and corresponding Mail Contents of correspondence mail header Mail classes to be determined are identical, then mail to be sorted is divided into the postal to be determined of correspondence mail header or correspondence Mail Contents In part classification；Otherwise, mail to be sorted is divided into into dialogue mail.So, if the error probability of mail classification is P_e, then it is based on It is P that the mail header and Mail Contents of mail to be sorted carries out respectively the error probability of the method for mail discriminant classification_e ², so as to, The error rate classified, that is, the accuracy for improving mail classification can be reduced.

More preferably, some senders would generally send the sample post of some one or more mail classes, therefore this Also the sender of the sample post of each mail classes can be recorded in bright, when mail to be sorted is received, can basis The sender of mail to be sorted, determines the mail classes belonging to the sample post sent before the sender, directly calculates and treats Mail classifying belongs to the probability of these mail classes, the probability more than setting probability threshold value and maximum is determined, by postal to be sorted Part is divided in the mail classes corresponding to the probability, so as to carry out mail classification based on part people, and can simplify calculating.

Based on above-mentioned process for sorting mailings, the internal structure block diagram of the mail sorter of the embodiment of the present invention, such as Fig. 3 It is shown, specifically include：Probability evaluation entity 301, category division module 302 and order module 304.

Wherein, probability evaluation entity 301 is used for for predetermined each mail classes, according to the spy of mail to be sorted Word is levied, after calculating the probability that mail to be sorted belongs to the mail classes, using the probability for calculating as to should mail classes Probability.

Order module 304 is used to be ranked up the probability of each mail classes of the correspondence for calculating, and obtains ranking results.

Whether category division module 302 is used to judging including in the Feature Words of mail to be sorted maximum general in ranking results At least one keyword of the mail classes corresponding to rate；If so, then by mail to be sorted be divided into maximum probability corresponding to Mail classes in；Otherwise：After the difference of the probability for calculating in ranking results maximum probability and sequence second, the difference is calculated The ratio of value and maximum probability；If the ratio that judgement is calculated is less than setting rate threshold value, and the Feature Words of mail to be sorted In include sequence second probability corresponding to mail classes at least one keyword, then mail to be sorted is divided into into row In mail classes corresponding to the probability of sequence second.

Further, if category division module 302 is additionally operable to judge that the ratio for calculating, will not less than setting rate threshold value Mail to be sorted is defined as talking with mail；If the ratio that judgement is calculated is less than setting rate threshold value, and the spy of mail to be sorted The keyword of the mail classes corresponding to the probability for not including sequence second in word is levied, then：Using the ratio for calculating as first After class probability rate, the difference of probability maximum in ranking results and the probability of sequence the 3rd is further calculated, by the difference With the ratio of maximum probability as the second class probability rate；It is determined that the second class probability rate is less than setting rate threshold Value, and include in the Feature Words of mail to be sorted sequence the 3rd probability corresponding to mail classes at least one keyword In the case of, mail to be sorted is divided in the mail classes corresponding to the probability of sequence the 3rd.

Further, above-mentioned mail sorter may also include：There is ratio anticipation module 303 in Feature Words.

There is ratio anticipation module 303 for for predetermined each mail classes, determining to be sorted in Feature Words The number of the Feature Words being contained in the Feature Words of mail in the feature lexicon of the mail classes, calculates the number and institute determined The total ratio of the Feature Words of mail to be sorted is stated, ratio occur in the Feature Words as mail to be sorted under the mail classes Rate；And when confirming that Feature Words of the mail to be sorted under the mail classes ratio occur more than the rate threshold for setting, triggering is general Rate computing module 301.Correspondingly, probability evaluation entity 301 calculates mail category to be sorted according to the Feature Words of mail to be sorted In the probability of the mail classes, using the probability for calculating as to should mail classes probability.

Wherein, the function that each module of mail sorter is realized refers to the mail point shown in above-mentioned Fig. 2 a, 2b Described in class method and step.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a kind of process for sorting mailings, it is characterised in that include：

For predetermined each mail classes, according to the Feature Words of mail to be sorted, the mail category to be sorted is calculated After the probability of the mail classes, using the probability for calculating as to should mail classes probability；

The probability of each mail classes of the correspondence for calculating is ranked up, and whether is judged in the Feature Words of the mail to be sorted Including at least one keyword of the mail classes corresponding to maximum probability；If so, then the mail to be sorted is divided into In mail classes corresponding to maximum probability；Otherwise：

The difference of the probability of maximum and the probability of sequence second is calculated, and calculates the ratio of the difference and the probability of maximum；If The ratio that judgement is calculated include in rate threshold value, and the Feature Words of the mail to be sorted the general of sequence second less than setting At least one keyword of the mail classes corresponding to rate, the then probability institute for the mail to be sorted being divided into into sequence second is right In the mail classes answered.

2. the method for claim 1, it is characterised in that described to calculate the mail to be sorted and belong to the mail classes Probability before, also include：

The number of the Feature Words being contained in the Feature Words for determining the mail to be sorted in the feature lexicon of the mail classes, The total ratio of the number and Feature Words of the mail to be sorted determined is calculated, as the mail to be sorted in the postal There is ratio in Feature Words under part classification；And it is big to confirm that ratio occur in Feature Words of the mail to be sorted under the mail classes In the rate threshold of setting.

3. method as claimed in claim 2, it is characterised in that the keyword of the mail classes is predetermined：

For each mail classes, for each Feature Words in the feature lexicon of the mail classes, the mail is counted in advance The quantity of the sample post for including this feature word in classification simultaneously carries out descending sequence；By the spy of forward setting number that sorts Word is levied as the keyword of the mail classes.

4. method as claimed in claim 3, it is characterised in that for predetermined each mail classes, according to be sorted The Feature Words of mail, calculate the probability that the mail to be sorted belongs to the mail classes, specifically include：

Remember that i-th mail classes is C_i, n Feature Words of the mail to be sorted are respectively F₁,F₂,...,F_n, calculate as follows The value of formula 1, as the mail to be sorted the probability of i-th mail classes is belonged to：

P(C_i)P(F₁|C_i)P(F₂|C_i)...P(F_n|C_i) (formula 1)

In formula 1,

Wherein, k takes the natural number between 1～n；It is characterized word F_kIn mail classes C_iMail data sample set in occur Number of times；For mail classes C_iFeature lexicon in each Feature Words in mail classes C_iMail data sample set in occur Number of times sum；For mail classes C_iMail data sample set in sample post quantity；S is the mail of each mail classes The quantity sum of the sample post that data sample is concentrated.

5. method as claimed in claim 4, it is characterised in that the feature lexicon of the mail classes is obtained according to following method Arrive：

For each mail classes, participle is carried out to the sample post in the mail data sample set of the mail classes, and counted Word frequency of the number of times that each word gone out after participle occurs in the mail data sample set of the mail classes as the word；Go After the uncommon word in each word and stop words after except participle, by word frequency more than setting lower threshold, less than setting upper limit threshold Word be defined as the alternative word of the mail classes；To remember in part-of-speech information in the alternative word of the mail classes and part-of-speech information table The alternative word that the part-of-speech information of record matches, is defined as the Feature Words of the mail classes, each Feature Words composition of the mail classes The feature lexicon of the mail classes；

Wherein, the mail data sample set of each mail classes is the similarity between the characteristic vector according to sample post, is based on Clustering algorithm marks off what is come.

6. the method as described in claim 4 or 5, it is characterised in that the Feature Words of the mail to be sorted are specifically included：From institute The title feature word extracted in the mail header for stating mail to be sorted, and carry from the Mail Contents of the mail to be sorted The content characteristic word of taking-up；And

The Feature Words according to mail to be sorted, calculate the probability that the mail to be sorted belongs to the mail classes, specifically Including：

According to the title feature word of the mail to be sorted, the mail header for calculating the mail to be sorted belongs to the mail class After other probability, using the probability as to should mail classes title probability；And

According to the content characteristic word of the mail to be sorted, the Mail Contents for calculating the mail to be sorted belong to the mail class After other probability, using the probability as to should mail classes content probability；And

The probability by each mail classes of the correspondence for calculating is ranked up, and judges in the Feature Words of the mail to be sorted Whether at least one keyword of mail classes maximum probability corresponding to is included；If so, then by the mail to be sorted stroke In assigning to the mail classes corresponding to the probability of maximum, specifically include：

The title probability of each mail classes of the correspondence for calculating is ranked up, if judging the title feature of the mail to be sorted Word includes at least one keyword of the mail classes corresponding to the title probability of maximum, then maximum title probability institute is right To be determined mail classes of the mail classes answered as correspondence mail header；And

The content probability of each mail classes of the correspondence for calculating is ranked up, if judging the content characteristic of the mail to be sorted Word includes the keyword of the mail classes corresponding to the content probability of maximum, then by the mail corresponding to maximum content probability To be determined mail classes of the classification as correspondence Mail Contents；

If the mail classes to be determined of the correspondence mail header are identical with the mail classes to be determined of the correspondence Mail Contents, Then the mail to be sorted is divided in the mail classes to be determined.

7. the method as described in claim 1-5 is arbitrary, it is characterised in that in the probability for calculating maximum and sequence second Probability difference, and calculate the difference with maximum probability ratio after, also include：

If judging, the described difference is not less than the setting rate threshold value with the ratio of the probability of maximum, by the postal to be sorted Part is defined as talking with mail；

If the described difference and the ratio of the probability of maximum are judged less than the setting rate threshold value, and the mail to be sorted Do not include the keyword of the mail classes corresponding to the probability of sequence second in Feature Words, then：

The described difference is calculated into maximum probability with the ratio of the probability of maximum as after the first class probability rate, further With sequence the 3rd probability difference, using the difference with maximum probability ratio as the second class probability rate；If it is determined that Second class probability rate include in rate threshold value, and the Feature Words of the mail to be sorted sequence the 3rd less than described setting Probability corresponding to mail classes at least one keyword, then by the mail to be sorted be divided into sequence the 3rd probability In corresponding mail classes.

8. a kind of mail sorter, it is characterised in that include：

Probability evaluation entity, for for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculating The mail to be sorted belongs to after the probability of the mail classes, using the probability for calculating as to should mail classes probability；

Category division module, for whether including maximum in the ranking results in the Feature Words for judging the mail to be sorted At least one keyword of the mail classes corresponding to probability；If so, then by the mail to be sorted be divided into maximum probability In corresponding mail classes；Otherwise：The difference of probability maximum in the ranking results and the probability of sequence second is calculated, And calculate the difference with maximum probability ratio；If it is determined that the ratio for calculating is less than setting rate threshold value, and described treat point At least one keyword of the mail classes corresponding to the probability of sequence second is included in the Feature Words of class mail, then will be described Mail to be sorted is divided in the mail classes corresponding to the probability of sequence second.

9. device as claimed in claim 8, it is characterised in that also include：

There is ratio anticipation module in Feature Words, for for predetermined each mail classes, determining the postal to be sorted The number of the Feature Words being contained in the Feature Words of part in the feature lexicon of the mail classes, calculate the number determined with it is described There is ratio in the total ratio of the Feature Words of mail to be sorted, the Feature Words as the mail to be sorted under the mail classes Rate；And when confirming that Feature Words of the mail to be sorted under the mail classes ratio occur more than the rate threshold for setting, touch Send out probability evaluation entity described.

10. device as claimed in claim 8 or 9, it is characterised in that

If the category division module is additionally operable to judge that the described difference is poor not less than the setting with the ratio of the probability of maximum Rate threshold value, then be defined as in the mail to be sorted talking with mail；If judging, the described difference is little with the ratio of the probability of maximum The mail corresponding to the probability of sequence second is not included in setting the rate threshold value, and the Feature Words of the mail to be sorted The keyword of classification, then：Using the ratio of the described difference and the probability of maximum as after the first class probability rate, further count The difference of probability maximum in the ranking results and the probability of sequence the 3rd is calculated, the difference is made with the ratio of the probability of maximum For the second class probability rate；It is determined that the second class probability rate sets rate threshold value, and the postal to be sorted less than described Include in the Feature Words of part sequence the 3rd probability corresponding to mail classes at least one keyword in the case of, by institute State mail to be sorted to be divided in the mail classes corresponding to the probability of sequence the 3rd.