CN103984703A

CN103984703A - Mail classification method and device

Info

Publication number: CN103984703A
Application number: CN201410163082.1A
Authority: CN
Inventors: 陈玉焓
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2014-04-22
Filing date: 2014-04-22
Publication date: 2014-08-13
Anticipated expiration: 2034-04-22
Also published as: CN103984703B

Abstract

The invention discloses a mail classification method and device. The method comprises the following steps: with regard to the class of each mail, calculating the possibility that a mail to be classified belongs to the mail class; taking the calculated possibility as the possibility of the class corresponding to the mail; sequencing the calculated possibility corresponding to each mail class and judging whether a feature word of the mail to be classified comprises at least one keyword of the mail class corresponding to the maximum possibility or not; if so, classifying the mail to be classified into the mail class corresponding to the maximum possibility; otherwise, calculating a difference value between the maximum possibility and a second possibility and a specific value between the difference value and the maximum possibility; if the specific value is less than a set rate threshold value and the feature word of the mail to be classified contains at least one keyword of the mail class corresponding to the second possibility, classifying the mail to be classified to the mail class corresponding to the second possibility. Therefore, the keyword of the set mail class enables the mail classification to be more accurate.

Description

Process for sorting mailings and device

Technical field

The present invention relates to internet arena, relate in particular to a kind of process for sorting mailings and device.

Background technology

Email adopts storage-pass-through mode transmission of information progressively on network, has the features such as velocity of propagation is fast, communicatee is extensive, with low cost.At current internet information in the epoch, people by Email, exchange or the behavior of communicating by letter more and more general.

Conventionally, in the mailbox of Email User, comprise polytype mail, such as, the class mails such as business's news, social activity, order, recruitment, training organization, bank financing, and common dialogue mail (as the mail of mutually greeting between friend) etc.If it is too much that in user's inbox, business interrogates the class mails such as popularization, can cause the too much problem of customer complaint, and by the indiscriminate inbox that is delivered to user of mail, may cause in user's inbox various types of mails mixed in together, thereby check that to user reading required mail causes puzzlement.Therefore, mailing system tends to mail to classify, and mail is divided into plurality of classes, so that user obtains mailbox better, experiences.For example, gmail mailbox has advertisement matter, website multidate information mail etc. outside common inbox, and qq mailbox has the mail of subscription etc. outside common inbox.

At present, existing a kind of process for sorting mailings is mainly based on clustering algorithm: according to the mail data of training sample mail, carry out the Feature Words that obtains after participle, training sample mail is divided into some mail classes, and forms respectively the mail data sample set of some mail classes; Afterwards, according to the Feature Words of the mail data of mail to be sorted, calculate the probability that mail to be sorted belongs to the mail data sample set of each mail classes, mail classes using the corresponding mail classes of maximum probability as mail to be sorted, and mail to be sorted is divided in the mail data sample set of this mail classes.Wherein, mail data is generally Mail Contents.

Yet, the present inventor finds, the process for sorting mailings accuracy of prior art is lower, the phenomenon that there will be some mail classes erroneous judgements, and make user can not view in time needed mail: such as, user may comparatively be concerned about recruitment class mail during hunting for a job, and the method for prior art but may be divided into recruitment class mail in training organization's class mail, makes user can not obtain in time recruiting the information of class mail; For another example, common dialogue mail is divided into business and interrogates class mail, may make user cannot check in time the common dialogue mail of these erroneous judgements, to user, bring very big inconvenience.Therefore, be necessary to provide a kind of process for sorting mailings that can classify to mail more accurately.

Summary of the invention

The defect existing for above-mentioned prior art, the invention provides a kind of process for sorting mailings and device, in order to improve the accuracy of classification of mail.

According to an aspect of the present invention, provide a kind of process for sorting mailings, having comprised:

For predetermined each mail classes, according to the Feature Words of mail to be sorted, calculate described mail to be sorted and belong to after the probability of this mail classes, using the probability calculating as the probability to should mail classes;

The probability of each mail classes of correspondence calculating is sorted, and judge at least one keyword that whether comprises the maximum corresponding mail classes of probability in the Feature Words of described mail to be sorted; If so, described mail to be sorted is divided in the corresponding mail classes of maximum probability; Otherwise:

Calculate the difference of maximum probability and sequence second probability, and calculate the ratio of this difference and maximum probability; If judge, the ratio calculating is less than sets rate threshold value, and in the Feature Words of described mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence second, described mail to be sorted is divided in the corresponding mail classes of probability of sequence second.

Preferably, described in calculate before described mail to be sorted belongs to the probability of this mail classes, also comprise:

Determine the number of the Feature Words in the feature lexicon that is contained in this mail classes in the Feature Words of described mail to be sorted, the total ratio of the number that calculative determination goes out and the Feature Words of described mail to be sorted, there is ratio in the Feature Words as described mail to be sorted under this mail classes; And confirm that the Feature Words of described mail to be sorted under this mail classes occurs that ratio is greater than the ratio threshold value of setting.

Wherein, the keyword of described mail classes is predetermined:

For each mail classes, for each Feature Words in the feature lexicon of this mail classes, count in advance the quantity of the sample post that comprises this Feature Words in this mail classes and carry out descending sequence; Keyword using the Feature Words of the forward setting number that sorts as this mail classes.

Preferably, for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculate the probability that described mail to be sorted belongs to this mail classes, specifically comprise:

Remember that i mail classes is C _i, the n of a described mail to be sorted Feature Words is respectively F ₁, F ₂..., F _n, calculate the value as shown in the formula 1, using it as described mail to be sorted, belong to the probability of i mail classes:

P(C _i) P (F ₁| C _i) P (F ₂| C _i) ... P (F _n| C _i) (formula 1)

In formula 1,

P (F_{k} | C_{i}) = \frac{f_{F_{k}} + 1}{f_{C_{i}} + 1}, P (C_{i}) = \frac{S_{C_{i}}}{S};

Wherein, k gets the natural number between 1～n; for Feature Words F _kat mail classes C _imail data sample set in the number of times that occurs; for mail classes C _ifeature lexicon in each Feature Words at mail classes C _imail data sample set in the number of times sum that occurs; for mail classes C _imail data sample set in the quantity of sample post; S is the quantity sum of the sample post in the mail data sample set of each mail classes.

Wherein, the feature lexicon of described mail classes obtains according to following method:

For each mail classes, the sample post in the mail data sample set of this mail classes is carried out to participle, and count number of times that each word after participle occurs in the mail data sample set of this mail classes as the word frequency of this word; Remove after the uncommon word and stop words in each word after participle, the word that word frequency is greater than and sets lower threshold, is less than capping threshold value is defined as the alternative word of this mail classes; The alternative word that the part of speech information recording in part of speech information in the alternative word of this mail classes and part of speech information table is matched, is defined as the Feature Words of this mail classes, and each Feature Words of this mail classes forms the feature lexicon of this mail classes;

Wherein, the mail data sample set of each mail classes is according to the similarity between the proper vector of sample post, based on clustering algorithm, divides out.

Preferably, the Feature Words of described mail to be sorted specifically comprises: the title Feature Words extracting from the mail header of described mail to be sorted, and the content characteristic word extracting from the Mail Contents of described mail to be sorted; And

Described according to the Feature Words of mail to be sorted, calculate the probability that described mail to be sorted belongs to this mail classes, specifically comprise:

According to the title Feature Words of described mail to be sorted, the mail header that calculates described mail to be sorted belongs to after the probability of this mail classes, using this probability as the title probability to should mail classes; And

According to the content characteristic word of described mail to be sorted, the Mail Contents that calculates described mail to be sorted belongs to after the probability of this mail classes, using this probability as the content probability to should mail classes; And

The described probability by each mail classes of correspondence calculating sorts, and judges at least one keyword that whether comprises the maximum corresponding mail classes of probability in the Feature Words of described mail to be sorted; If so, described mail to be sorted is divided in the corresponding mail classes of maximum probability, specifically comprises:

The title probability of each mail classes of correspondence calculating is sorted, if judge, the title Feature Words of described mail to be sorted comprises at least one keyword of the maximum corresponding mail classes of title probability, the mail classes to be determined using the maximum corresponding mail classes of title probability as corresponding mail header; And

The content probability of each mail classes of correspondence calculating is sorted, if judge, the content characteristic word of described mail to be sorted comprises the keyword of the maximum corresponding mail classes of content probability, the mail classes to be determined using the maximum corresponding mail classes of content probability as corresponding Mail Contents;

If the mail classes to be determined of described corresponding mail header is identical with the mail classes to be determined of described corresponding Mail Contents, described mail to be sorted is divided in described mail classes to be determined.

Preferably, in the described difference that calculates maximum probability and sequence second probability, and calculate after the ratio of this difference and maximum probability, also comprise:

If judge, this ratio is not less than described setting rate threshold value, is defined as talking with mail by described mail to be sorted;

If judge, this ratio is less than described setting rate threshold value, and in the Feature Words of described mail to be sorted, does not comprise the keyword of the corresponding mail classes of probability of sequence second:

This ratio, after the first class probability rate, is further calculated to the difference of maximum probability and sequence the 3rd probability, using the ratio of this difference and maximum probability as the second class probability rate; If determine, the second class probability rate is less than described setting rate threshold value, and in the Feature Words of described mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence the 3rd, described mail to be sorted is divided in the corresponding mail classes of probability of sequence the 3rd.

According to another aspect of the present invention, also provide a kind of classification of mail device, having comprised:

Probability calculation module, for for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculates described mail to be sorted and belongs to after the probability of this mail classes, using the probability calculating as the probability to should mail classes;

Order module, for the probability of each mail classes of correspondence calculating is sorted, obtains ranking results;

Category division module, for judging whether the Feature Words of described mail to be sorted comprises at least one keyword of the corresponding mail classes of probability maximum in described ranking results; If so, described mail to be sorted is divided in the corresponding mail classes of maximum probability; Otherwise: calculate in described ranking results after the difference of probability of maximum probability and sequence second, calculate the ratio of this difference and maximum probability; If judge, the ratio calculating is less than sets rate threshold value, and in the Feature Words of described mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence second, described mail to be sorted is divided in the corresponding mail classes of probability of sequence second.

Further, described classification of mail device, also comprises:

There is ratio anticipation module in Feature Words, be used for for predetermined each mail classes, determine the number of the Feature Words in the feature lexicon that is contained in this mail classes in the Feature Words of described mail to be sorted, the total ratio of the number that calculative determination goes out and the Feature Words of described mail to be sorted, there is ratio in the Feature Words as described mail to be sorted under this mail classes; And while confirming that the Feature Words of described mail to be sorted under this mail classes occurs that ratio is greater than the ratio threshold value of setting, trigger described probability calculation module.

Preferably, if described category division module is also for judging that described ratio is not less than described setting rate threshold value, is defined as talking with mail by described mail to be sorted; If judge, described ratio is less than described setting rate threshold value, and the keyword that does not comprise the corresponding mail classes of probability of sequence second in the Feature Words of described mail to be sorted,: using described ratio after the first class probability rate, further calculate the difference of the probability of probability maximum in described ranking results and sequence the 3rd, using the ratio of this difference and maximum probability as the second class probability rate; At definite the second class probability rate, be less than described setting rate threshold value, and in the Feature Words of described mail to be sorted, include in the situation of at least one keyword of sequence the 3rd the corresponding mail classes of probability, described mail to be sorted is divided in the corresponding mail classes of probability of sequence the 3rd.

In technical scheme of the present invention, owing to having set respectively keyword for each mail classes, the probability that mail to be sorted is belonged to each mail classes, combine and carry out classification of mail with the keyword of mail classes, thereby avoid some the non-key word impacts on the accuracy of classification of mail in mail to be sorted, and the calculating of the class probability rate based on mail to be sorted, in the time of in mail to be sorted can not being divided into the corresponding mail classes of maximum probability, certified mail classification still has higher accuracy.

Further, there is the calculating of ratio in the Feature Words of the mail to be sorted in the present invention under each mail classes, can simplify the calculating in classification of mail process, and the accuracy of certified mail classification; And, according to the mail matter topics of mail to be sorted and Mail Contents, carry out classification of mail respectively, further the accuracy of certified mail classification.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of definite mail classes of the embodiment of the present invention and the method for mail data sample set and feature lexicon thereof;

Fig. 2 a, 2b are the process flow diagram of the process for sorting mailings of the embodiment of the present invention;

Fig. 3 is the inner structure block diagram of the classification of mail device of the embodiment of the present invention.

Embodiment

For making object of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, it should be noted that, many details of listing in instructions are only used to make reader to have a thorough understanding to one or more aspects of the present invention, even if do not have these specific details also can realize these aspects of the present invention.

The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can and/or be distributed on a computing machine between two or more computing machines.

The present inventor finds, the reason of the method erroneous judgement mail of prior art is, while including the feature of more certain not representative mail classes in the Mail Contents of certain envelope mail, may make this mail calculating belong to the maximum probability of this mail classes, if this mail is divided in this mail classes and may be inaccurate.For example, if the dialogue mail between two friends, refer to the situation that inquiry is worked each other, and make the word that comprises welfare, treatment, position etc. in Mail Contents, and these words may belong to some features of recruitment class mail, the method for prior art may be divided into this mail in recruitment class mail by mistake.

Consider thus, can be respectively in advance each mail classes and set classifying rules, be about to the keyword that some more representative words are set as mail classes.For example, one or several word of " work ", " resume ", " recruitment " etc. is set as recruiting the keyword of class mail.Like this, obtain the probability that mail to be sorted belongs to each mail classes, and determine after the corresponding mail classes of maximum probability, first judge the keyword that whether includes this mail classes in the Feature Words of mail to be sorted, if do not show, mail to be sorted does not meet the classifying rules of this mail classes, can, according to sequence at the difference (being called class probability rate herein) of the probability of front two and the keyword that comes second the corresponding mail classes of probability, determine whether mail to be sorted to be divided in the corresponding mail classes of probability that comes second.Thereby the keyword based on mail classes and class probability rate, can treat more exactly mail classifying and classify.

Below in conjunction with accompanying drawing, describe technical scheme of the present invention in detail.In the embodiment of the present invention, before carrying out classification of mail, can pre-determine out mail data sample set and the feature lexicon of some mail classes (as business's news, social activity, bank card, recruitment information, sequence information, log-on message, news) and each mail classes, thereby on the basis of predetermined mail classes, treat mail classifying and classify.Particularly, pre-determine the flow process of the mail data sample set of some mail classes and each mail classes and the method for feature lexicon, as shown in Figure 1, specifically comprise the steps:

S101: for each sample post in mail set to be trained, obtain the set of words of this sample post, according to the set of words of each sample post obtaining, determine the set of words of mail set to be trained, and then determine the word feature vector of this sample post.

Particularly, can from mail server enter extracting mailbox in setting-up time section or setting the sample post of the non-dialogue mail of quantity, using these sample posts as set element, form mail set to be trained.For each sample post in mail set to be trained, the mail data of this sample post (comprising mail header and Mail Contents) is carried out to participle, and remove stop words and the rarely used word in each word marking off through participle, obtain the set of words of this sample post.The set of words of each sample post in mail set to be trained is merged into same set of words, gets rid of in the set of words of each sample post because of the word of repetition redundancy, obtain the set of words of this mail set to be trained.

For each sample post in mail set to be trained, the dimension of the word feature vector using the word sum in the set of words of mail set to be trained as this sample post, and by each word in the set of words of mail set to be trained, correspond to respectively each vector element of the word feature vector of this sample post; For each vector element in the word feature vector of this sample post, definite method of this vector element value is as follows: if the word in the set of words of this vector element mail set corresponding to be trained is included in the set of words of this sample post, this vector element value is set to 1; Otherwise this vector element value is set to 0.For example, the word feature of a sample post in mail set to be trained vector embodiments is D=[d ₁..., d _j.., d _l], d wherein _jvalue be 1 or 0, get 1 and represent that in the set of words of mail set to be trained, j word is included in the set of words of current sample post, get 0 and represent that in the set of words of mail set to be trained, j word is not included in the set of words of current sample post; Wherein, the natural number that j is 1～L, L is the word sum of the set of words of mail set to be trained.

S102: according to the similarity between the word feature vector of the sample post in mail set to be trained, the sample post that adopts clustering algorithm to treat in the set of training mail carries out cluster, obtains some bunches.

Particularly, conventionally can adopt cosine similarity calculating method, calculate the similarity between the word feature vector of any two sample posts, namely the similarity between any two sample posts.For example, the word feature vector of sample post x and sample post y is respectively X=[x ₁..., x _j.., x _l] and Y=[y ₁..., y _j.., y _l], can calculate the proper vector of sample post x and the similarity Sim (X, Y) between sample post y according to following formula 2:

Sim (X, Y) = \frac{Σ x_{j} \cdot y_{j}}{\sqrt{Σ x_{j}^{2} \times \sqrt{Σ y_{j}^{2}}}}

(formula 2)

Like this, in this step, can be according to the similarity between the word feature vector of the sample post in mail set to be trained, build similarity matrix, and the sample post that adopts clustering algorithm (such as hierarchical clustering algorithm) to treat in training mail set carries out cluster, obtain meeting some bunches of predefined cluster termination condition.For example, the maximum similarity between cluster termination condition can being set as bunch reaches setting similarity threshold, or bunch in the quantity of sample post reach setting numerical value.Wherein, structure similarity matrix and employing clustering algorithm carry out cluster and are well known to those skilled in the art, and repeat no more herein.

S103: each obtaining for cluster bunch, is divided into the sample post comprising in this bunch in same mail classes, and the sample post of each mail classes is formed to the mail data sample set of this mail classes.

S104: for the mail data sample set of each mail classes, extract the Feature Words of the sample post in the mail data sample set of this mail classes, and then obtain the feature lexicon of this mail classes.

In this step, mail data sample set for each mail classes obtaining in above-mentioned steps S103, extract the Feature Words of the sample post in the mail data sample set of this mail classes, be specially: each sample post in the mail data sample set of this mail classes is carried out to participle, count number of times that each word after participle occurs in the mail data sample set of this mail classes as the word frequency of this word; Remove after the uncommon word and stop words in each word after participle, the word that word frequency is greater than and sets lower threshold, is less than capping threshold value is defined as the alternative word of this mail classes; The alternative word that the part of speech information recording in part of speech information in the alternative word of this mail classes and part of speech information table is matched, is defined as the Feature Words of this mail classes, and each Feature Words of this mail classes forms the feature lexicon of this mail classes.Wherein, sample post is carried out to participle and namely the mail header of sample post and Mail Contents are carried out to participle; The part of speech that records promising raising classification of mail accuracy in part of speech information table and determine, as adverbial word function idiom, noun, adjective, verb, time, place morpheme, measure word etc., filters out name, auxiliary word etc.

Based on predetermined each mail classes, the flow process of the process for sorting mailings that the embodiment of the present invention provides, as shown in Fig. 2 a, 2b, specifically comprises the steps:

S201: for mail to be sorted, extract the Feature Words in the mail data of mail to be sorted; Make i=1.

Particularly, can adopt existing minute word algorithm to treat mail classifying and carry out participle, remove after the uncommon word and stop words in each word after participle, obtain the Feature Words of mail to be sorted.

For simplifying to calculate, increase classification accuracy simultaneously, can be after step S201, and in step S202～S205, for predetermined each mail classes, calculate the Feature Words of mail to be sorted under this mail classes and occur ratio, and occur that ratio is less than and set after ratio threshold value determining the Feature Words of mail to be sorted under this mail classes, calculate the probability that mail to be sorted belongs to this mail classes.

S202: for predetermined i mail classes, calculate the Feature Words of mail to be sorted under i mail classes and occur ratio.

Particularly, can be for predetermined i mail classes, determine the number of the Feature Words in the feature lexicon that is contained in i mail classes in the Feature Words of mail to be sorted, the total ratio of the number that calculative determination goes out and the Feature Words of mail to be sorted, using the ratio calculating, as mail to be sorted there is ratio in the Feature Words under i mail classes.Wherein, the Feature Words of mail to be sorted under certain mail classes occurs that ratio can reflect that the Feature Words of mail to be sorted appears at the number in the feature lexicon of this mail classes, namely can reflect that mail to be sorted belongs to the possibility of this mail classes; If the Feature Words of mail to be sorted under certain mail classes occurs that ratio is less, to belong to the probability of this mail classes less for mail to be sorted; It is larger that otherwise mail to be sorted belongs to the probability of this mail classes.Wherein, i gets the natural number between 1～m, the number that m is predetermined mail classes.

S203: the Feature Words by mail to be sorted under i mail classes occurs that ratio compares with setting ratio threshold value, and judge whether comparative result is that the Feature Words of mail to be sorted under i mail classes occurs that ratio is greater than setting ratio threshold value; If so, perform step S204; Otherwise, jump to step S205.

That is to say, judge that the Feature Words of mail to be sorted under i mail classes occurs whether ratio is greater than setting ratio threshold value; If so, calculate the probability that mail to be sorted belongs to i mail classes; Otherwise, directly judge whether i equals m, and do not carry out the calculating that mail to be sorted belongs to the probability of i mail classes, when guaranteeing classification accuracy, also simplified calculating.

S204: according to the Feature Words of mail to be sorted, calculate mail to be sorted and belong to after the probability of i mail classes, the probability using the probability calculating as corresponding i mail classes.

If being the Feature Words of mail to be sorted under i mail classes, comparative result occurs that ratio is greater than setting ratio threshold value, calculates the probability that mail to be sorted belongs to this mail classes.

Particularly, can be based on existing NB Algorithm, suppose between the Feature Words of mail to be sorted separately, and remember that i mail classes is C _i, the n of a mail to be sorted Feature Words is respectively F ₁, F ₂..., F _n, based on NB Algorithm, mail to be sorted belongs to i mail classes C _iprobability can be expressed as the P (C in formula 3 _i| F ₁, F ₂..., F _n):

P (C_{i} | F_{1}, F_{2}, . . ., F_{n}) = \frac{P (C_{i}) P (F_{1}, F_{2}, . . ., F_{n} | C_{i})}{P (F_{1}, F_{2}, . . ., F_{n})}

(formula 3)

Separate between Feature Words due to mail to be sorted, therefore:

P(F ₁,F ₂,...,F _n|C _i)＝P(F ₁|C _i)P(F ₂|C _i)...P(F _n|C _i)；

And:

P(F ₁,F ₂,...,F _n)＝P(F ₁)P(F ₂)...P(F _n)；

For each mail classes, P (F ₁, F ₂... F _n) be identical, therefore:

P(C _i|F ₁,F ₂,...F _n)∝P(C _i)P(F ₁|C _i)P(F ₂|C _i)...P(F _n|C _i)；

Thereby can P (C will be calculated _i| F ₁, F ₂..F _n) be converted into and calculate P (C _i) and P (F _k| C _i), therefore, can calculate the value as shown in the formula 1, using it as mail to be sorted, belong to the probability of i mail classes:

P(C _i) P (F ₁| C _i) P (F ₂| C _i) ... P (F _n| C _i) (formula 1)

In formula 1,

P (F_{k} | C_{i}) = \frac{f_{F_{k}} + 1}{f_{C_{i}} + 1}, P (C_{i}) = \frac{S_{C_{i}}}{S};

And ratio appears in the Feature Words of mail to be sorted, and this judges the use of the factor, carries out mail to be sorted belong to i mail classes C at the NB Algorithm based on above-mentioned _iprobability calculating and and then while carrying out classification of mail, avoid mail to be sorted certain Feature Words at mail classes C _imail data sample set in the number of times that occurs compared with high and affect the definite situation of mail classes; For example, Feature Words F ₁at mail classes C _imail data sample set in the number of times that occurs very large, and further feature base is not originally at mail classes C _imail data sample set in occur, may be because of P (F ₁| C _i) compared with making greatly P (C _i) P (F ₁| C _i) P (F ₂| C _i) ... P (F _n| C _i) P (F ₁| C _i) larger, and then make the classification of mail to be sorted not accurate enough, and Feature Words occur that the use of this judgement factor of ratio can avoid the appearance of this class situation well.

In addition,, after also can obtaining the word feature vector of mail to be sorted, each element for the treatment of in the word feature vector of mail classifying is normalized, and calculates proper vector and i the mail classes C of mail to be sorted _iin the proper vector of each sample post between similarity, and then calculate the mean value of each similarity, using the mean value calculating as mail to be sorted, belong to the probability of i mail classes.

S205: judge whether i equals m; If so, perform step S206; Otherwise, make after i=i+1, jump to step S202.

Particularly, the number that m is predetermined mail classes, if i=m shows to have treated the Feature Words of mail classifying under each mail classes and occurs that ratio calculates.If i ≠ m, makes i=i+1, jump to step S202, calculate the Feature Words of mail to be sorted under next (i+1) mail classes and occur ratio.

S206: after the probability of each mail classes of correspondence calculating is sorted, judge at least one keyword that whether comprises the maximum corresponding mail classes of probability in the Feature Words of mail to be sorted; If so, perform step S207; Otherwise, execution step S210.

Particularly, for each mail classes, the pre-stored antistop list that has this mail classes, and the keyword in the antistop list of this mail classes is normally predetermined, be specifically as follows: for each mail classes, for each Feature Words in the feature lexicon of this mail classes, count in advance the quantity of the sample post that comprises this Feature Words in this mail classes and carry out descending sequence; Keyword using the Feature Words of the forward setting number that sorts as this mail classes.Or, can also rule of thumb to the keyword of each mail classes, be set respectively by those skilled in the art.For example, " work ", " resume " " recruitment " etc. are set as recruiting the keyword of class mail.

In this step, the probability of each mail classes of correspondence calculating is carried out to descending sequence, and judge at least one keyword that whether comprises the maximum corresponding mail classes of probability in the Feature Words of mail to be sorted, namely determine whether mail to be sorted meets the classifying rules of the maximum corresponding mail classes of probability; If the Feature Words of mail to be sorted comprises one or more in the keyword of the maximum corresponding mail classes of probability, show that mail to be sorted meets the classifying rules of the maximum corresponding mail classes of probability, can directly mail to be sorted be divided in this mail classes; If do not comprise the keyword of the maximum corresponding mail classes of probability in the Feature Words of mail to be sorted, show mail to be sorted to be divided in the corresponding mail classes of maximum probability not accurate enough, can process according to following step S210～S216.

S207: mail to be sorted is divided in the corresponding mail classes of maximum probability.

If determine, the Feature Words of mail to be sorted comprises is divided into one or more in the keyword of the maximum corresponding mail classes of probability mail to be sorted in the mail classes of maximum probability.

S210: make h=2.Wherein, h is more than or equal to 2 and be less than or equal to the natural number of m.

S211: calculate after the difference of probability of maximum probability and sequence h, calculate the ratio of this difference and maximum probability, using it as h-1 class probability rate.

For example, during h=2, calculate after the difference of maximum probability and sequence second probability, calculate the ratio of this difference and maximum probability, as the first class probability rate; Namely, if maximum probability is P ₁, the probability of sequence second is P ₂, the first class probability rate of mail to be sorted is P _d1=(P ₁-P ₂) P ₁.

For another example, the probability of sequence the 3rd is P ₃, the second class probability rate is P _d2=(P ₁-P ₃) P ₁.

S212: judge whether h-1 class probability rate is less than setting rate threshold value; If so, perform step S213; Otherwise, execution step S216.

Wherein, setting rate threshold value can be set according to the situation of actual classification of mail by those skilled in the art.

S213: judge at least one keyword that whether comprises the corresponding mail classes of probability of the h that sorts in the Feature Words of mail to be sorted; If so, perform step S214; Otherwise, execution step S215.

For example, during h=2, if the first class probability rate is less than, set rate threshold value, judge at least one keyword that whether includes the corresponding mail classes of probability of sequence second in the Feature Words of mail to be sorted.

S214: mail to be sorted is divided in the corresponding mail classes of probability of sequence h.

If judge the h-1 class probability rate of mail to be sorted in step S212, be less than setting rate threshold value, mail to be sorted be divided in the corresponding mail classes of probability of sequence h.

For example, during h=2, judge the first class probability rate and be less than setting rate threshold value, and in the Feature Words of mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence second, mail to be sorted is divided in the corresponding mail classes of probability of sequence second.

For another example, during h=3, at definite the second class probability rate, be less than and set rate threshold value, and in the Feature Words of mail to be sorted, include in the situation of at least one keyword of sequence the 3rd the corresponding mail classes of probability, mail to be sorted is divided in the corresponding mail classes of probability of sequence the 3rd.

S215: judge whether h equals m; If so, perform step S216; Otherwise, make after h=h+1, jump to step S211.

Particularly, if being less than, h-1 class probability rate sets rate threshold value, and the keyword that does not comprise the corresponding mail classes of probability of the h that sorts in the Feature Words of mail to be sorted, further calculate the difference of the probability of maximum probability and sequence h+1, using the ratio of this difference and maximum probability as h class probability rate, and according to h class probability rate, mail is classified.

S216: mail to be sorted is defined as talking with mail.

If judge in step S212, the h-1 class probability rate of mail to be sorted is not less than (being more than or equal to) and sets rate threshold value, in this step mail to be sorted is defined as to common dialogue mail.For example, during h=2, if the first class probability rate is not less than, set rate threshold value, mail to be sorted is defined as talking with mail.

Or, in step S215, judge after h=m, in this step mail to be sorted is defined as to common dialogue mail.

More preferably, mail header and Mail Contents that the present invention can also treat mail classifying carry out respectively participle, extract title Feature Words, and extract content characteristic word from the mail header of mail to be sorted from the Mail Contents of mail to be sorted; In other words, the Feature Words of mail to be sorted specifically comprises title Feature Words and content characteristic word.And, after extracting the title Feature Words and content characteristic word of mail to be sorted, can be according to the title Feature Words of mail to be sorted, the mail header that calculates mail to be sorted belongs to the probability of this mail classes, using this probability as the title probability to should mail classes; And according to the content characteristic word of mail to be sorted, the Mail Contents that calculates mail to be sorted belongs to the probability of this mail classes, using this probability as the content probability to should mail classes.Afterwards, the title probability of each mail classes of correspondence calculating is sorted, if judge, the title Feature Words of mail to be sorted comprises at least one keyword of the maximum corresponding mail classes of title probability, the mail classes to be determined using the maximum corresponding mail classes of title probability as corresponding mail header; And the content probability of each mail classes of correspondence calculating is sorted, if judge, the content characteristic word of mail to be sorted comprises the keyword of the maximum corresponding mail classes of content probability, the mail classes to be determined using the maximum corresponding mail classes of content probability as corresponding Mail Contents; If the mail classes to be determined of corresponding mail header is identical with the mail classes to be determined of corresponding Mail Contents, mail to be sorted is divided in the mail classes to be determined of corresponding mail header or corresponding Mail Contents; Otherwise, mail to be sorted is divided into dialogue mail.The error probability of like this, establishing classification of mail is P _e, the mail header based on mail to be sorted and Mail Contents carry out respectively classification of mail to sentence the error probability of method for distinguishing are P _e ²thereby,, can reduce the error rate of classification, namely improve the accuracy of classification of mail.

More preferably, some sender can send the sample post of some one or more mail classes conventionally, therefore in the present invention, also can carry out record to the sender of the sample post of each mail classes, when receiving mail to be sorted, can be according to the sender of mail to be sorted, determine the affiliated mail classes of sample post sending before this sender, directly calculate the probability that mail to be sorted belongs to these mail classes, determine to be greater than and set probability threshold value and maximum probability, mail to be sorted is divided in the corresponding mail classes of this probability, thereby can carry out classification of mail based on part people, and can simplify calculating.

Process for sorting mailings based on above-mentioned, the inner structure block diagram of the classification of mail device of the embodiment of the present invention, as shown in Figure 3, specifically comprises: probability calculation module 301, category division module 302 and order module 304.

Wherein, probability calculation module 301 is for for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculates after the probability that mail to be sorted belongs to this mail classes, using the probability calculating as the probability to should mail classes.

Order module 304, for the probability of each mail classes of correspondence calculating is sorted, obtains ranking results.

Category division module 302 is for judging whether the Feature Words of mail to be sorted comprises at least one keyword of the corresponding mail classes of probability maximum in ranking results; If so, mail to be sorted is divided in the corresponding mail classes of maximum probability; Otherwise: calculate in ranking results after the difference of probability of maximum probability and sequence second, calculate the ratio of this difference and maximum probability; If being less than, the ratio that judgement calculates sets rate threshold value, and in the Feature Words of mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence second, mail to be sorted is divided in the corresponding mail classes of probability of sequence second.

Further, if the ratio that category division module 302 also calculates for judgement is not less than, set rate threshold value, mail to be sorted is defined as talking with mail; If being less than, the ratio that judgement calculates sets rate threshold value, and the keyword that does not comprise the corresponding mail classes of probability of sequence second in the Feature Words of mail to be sorted,: using the ratio calculating after the first class probability rate, further calculate the difference of the probability of probability maximum in ranking results and sequence the 3rd, using the ratio of this difference and maximum probability as the second class probability rate; At definite the second class probability rate, be less than and set rate threshold value, and in the Feature Words of mail to be sorted, include in the situation of at least one keyword of sequence the 3rd the corresponding mail classes of probability, mail to be sorted is divided in the corresponding mail classes of probability of sequence the 3rd.

Further, above-mentioned classification of mail device also can comprise: ratio anticipation module 303 appears in Feature Words.

Feature Words occurs that ratio anticipation module 303 is for for predetermined each mail classes, determine the number of the Feature Words in the feature lexicon that is contained in this mail classes in the Feature Words of mail to be sorted, the total ratio of the number that calculative determination goes out and the Feature Words of described mail to be sorted, there is ratio in the Feature Words as mail to be sorted under this mail classes; And while confirming that the Feature Words of mail to be sorted under this mail classes occurs that ratio is greater than the ratio threshold value of setting, trigger probability calculation module 301.Correspondingly, probability calculation module 301, according to the Feature Words of mail to be sorted, calculates the probability that mail to be sorted belongs to this mail classes, using the probability calculating as the probability to should mail classes.

Wherein, the function that each module of classification of mail device realizes can be with reference to described in the process for sorting mailings step shown in above-mentioned Fig. 2 a, 2b.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a process for sorting mailings, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, described in calculate before described mail to be sorted belongs to the probability of this mail classes, also comprise:

3. method as claimed in claim 2, is characterized in that, the keyword of described mail classes is predetermined:

4. method as claimed in claim 3, is characterized in that, for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculates the probability that described mail to be sorted belongs to this mail classes, specifically comprises:

P(C _i) P (F ₁| C _i) P (F ₂| C _i) ... P (F _n| C _i) (formula 1)

In formula 1,

P (F_{k} | C_{i}) = \frac{f_{F_{k}} + 1}{f_{C_{i}} + 1}, P (C_{i}) = \frac{S_{C_{i}}}{S};

5. method as claimed in claim 4, is characterized in that, the feature lexicon of described mail classes obtains according to following method:

6. the method as described in claim 4 or 5, it is characterized in that, the Feature Words of described mail to be sorted specifically comprises: the title Feature Words extracting from the mail header of described mail to be sorted, and the content characteristic word extracting from the Mail Contents of described mail to be sorted; And

7. the method as described in as arbitrary in claim 1-5, is characterized in that, in the described difference that calculates the probability of maximum probability and sequence second, and calculates after the ratio of this difference and maximum probability, also comprises:

8. a classification of mail device, is characterized in that, comprising:

Category division module, for judging whether the Feature Words of described mail to be sorted comprises at least one keyword of the corresponding mail classes of probability maximum in described ranking results; If so, described mail to be sorted is divided in the corresponding mail classes of maximum probability; Otherwise: calculate in described ranking results the difference of the probability of maximum probability and sequence second, and calculate the ratio of this difference and maximum probability; If judge, the ratio calculating is less than sets rate threshold value, and in the Feature Words of described mail to be sorted, include at least one keyword of the corresponding mail classes of probability of sequence second, described mail to be sorted is divided in the corresponding mail classes of probability of sequence second.

9. device as claimed in claim 8, is characterized in that, also comprises:

10. install as claimed in claim 8 or 9, it is characterized in that,

If described category division module is also for judging that described ratio is not less than described setting rate threshold value, is defined as talking with mail by described mail to be sorted; If judge, described ratio is less than described setting rate threshold value, and the keyword that does not comprise the corresponding mail classes of probability of sequence second in the Feature Words of described mail to be sorted,: using described ratio after the first class probability rate, further calculate the difference of the probability of probability maximum in described ranking results and sequence the 3rd, using the ratio of this difference and maximum probability as the second class probability rate; At definite the second class probability rate, be less than described setting rate threshold value, and in the Feature Words of described mail to be sorted, include in the situation of at least one keyword of sequence the 3rd the corresponding mail classes of probability, described mail to be sorted is divided in the corresponding mail classes of probability of sequence the 3rd.