CN103984703B - Mail classification method and device - Google Patents
Mail classification method and device Download PDFInfo
- Publication number
- CN103984703B CN103984703B CN201410163082.1A CN201410163082A CN103984703B CN 103984703 B CN103984703 B CN 103984703B CN 201410163082 A CN201410163082 A CN 201410163082A CN 103984703 B CN103984703 B CN 103984703B
- Authority
- CN
- China
- Prior art keywords
- probability
- classes
- sorted
- maximum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Marketing (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a mail classification method and device. The method comprises the following steps: with regard to the class of each mail, calculating the possibility that a mail to be classified belongs to the mail class; taking the calculated possibility as the possibility of the class corresponding to the mail; sequencing the calculated possibility corresponding to each mail class and judging whether a feature word of the mail to be classified comprises at least one keyword of the mail class corresponding to the maximum possibility or not; if so, classifying the mail to be classified into the mail class corresponding to the maximum possibility; otherwise, calculating a difference value between the maximum possibility and a second possibility and a specific value between the difference value and the maximum possibility; if the specific value is less than a set rate threshold value and the feature word of the mail to be classified contains at least one keyword of the mail class corresponding to the second possibility, classifying the mail to be classified to the mail class corresponding to the second possibility. Therefore, the keyword of the set mail class enables the mail classification to be more accurate.
Description
Technical field
The present invention relates to internet arena, more particularly to a kind of process for sorting mailings and device.
Background technology
Email progressively transmission information on network using storage-pass-through mode, with spread speed soon, communicatee
Extensively, with low cost the features such as.In the current internet information epoch, people are exchanged by Email or are communicated
Behavior it is more and more universal.
Generally, in the mailbox of Email User include polytype mail, such as, and Shang Xun, social activity, order, recruitment,
The class mail such as training organization, bank financing, and common dialogue mail (mail mutually greeted such as between friend).If user
Inbox in business's news the class mail such as to promote excessive, then can cause the excessive problem of customer complaint, and mail is indiscriminate
In being delivered to the inbox of user, it is mixed in together to may result in various types of mails in the inbox of user, so as to give
User checks that mail needed for reading causes puzzlement.Therefore, mailing system is often classified to mail, mail is divided into many
Classification is planted, so that user obtains preferably mailbox experience.For example, gmail mailboxes have outside common inbox advertisement matter,
Website multidate information mail etc., qq mailboxes have subscription mail etc. outside common inbox.
At present, a kind of existing process for sorting mailings is mainly based upon clustering algorithm:According to the mail of training sample mail
Data carry out the Feature Words obtained after participle, and training sample mail is divided into into some mail classes, and separately constitute some postals
The mail data sample set of part classification;Afterwards, according to the Feature Words of the mail data of mail to be sorted, mail category to be sorted is calculated
In the probability of the mail data sample set of each mail classes, using the mail classes corresponding to maximum probability as postal to be sorted
The mail classes of part, and mail to be sorted is divided in the mail data sample set of the mail classes.Wherein, mail data leads to
It is often Mail Contents.
However, it was found by the inventors of the present invention that the process for sorting mailings degree of accuracy of prior art is relatively low, it may appear that some postals
The phenomenon of part classification erroneous judgement, and user is caused from viewing required mail in time:Such as, user may during hunting for a job
More it is concerned about recruitment class mail, recruitment class mail may be but divided in training organization's class mail, be made by the method for prior art
Obtaining user can not obtain in time recruiting the information of class mail;For another example, common dialogue mail is divided into into business's news class mail, may
So that user cannot in time check the common dialogue mail of these erroneous judgements, to user very big inconvenience is brought.Therefore, it is necessary to carry
For a kind of process for sorting mailings that more accurately can be classified to mail.
The content of the invention
For the defect that above-mentioned prior art is present, the invention provides a kind of process for sorting mailings and device, to carry
The accuracy of high mail classification.
According to an aspect of the invention, there is provided a kind of process for sorting mailings, including:
For predetermined each mail classes, according to the Feature Words of mail to be sorted, the postal to be sorted is calculated
Part belongs to after the probability of the mail classes, using the probability for calculating as to should mail classes probability;
The probability of each mail classes of the correspondence for calculating is ranked up, and is judged in the Feature Words of the mail to be sorted
Whether at least one keyword of mail classes maximum probability corresponding to is included;If so, then by the mail to be sorted stroke
In assigning to the mail classes corresponding to the probability of maximum;Otherwise:
The difference of the probability of maximum and the probability of sequence second is calculated, and calculates the ratio of the difference and the probability of maximum
Value;If it is determined that the ratio for calculating in setting rate threshold value, and the Feature Words of the mail to be sorted less than including sequence the
At least one keyword of the mail classes corresponding to two probability, then be divided into the general of sequence second by the mail to be sorted
In mail classes corresponding to rate.
Calculate the mail to be sorted and belong to before the probability of the mail classes it is preferred that described, also include:
The Feature Words being contained in the Feature Words for determining the mail to be sorted in the feature lexicon of the mail classes
Number, calculates the total ratio of the number and Feature Words of the mail to be sorted determined, as the mail to be sorted
There is ratio in Feature Words under the mail classes;And confirm that Feature Words of the mail to be sorted under the mail classes occur
Rate threshold of the ratio more than setting.
Wherein, the keyword of the mail classes is predetermined:
For each mail classes, for each Feature Words in the feature lexicon of the mail classes, this is counted in advance
The quantity of the sample post for including this feature word in mail classes simultaneously carries out descending sequence;By the forward setting number that sorts
Feature Words as the mail classes keyword.
It is preferred that for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculating described treating
Mail classifying belongs to the probability of the mail classes, specifically includes:
Remember that i-th mail classes is Ci, n Feature Words of the mail to be sorted are respectively F1,F2,...,Fn, calculate
Such as the value of following formula 1, as the mail to be sorted the probability of i-th mail classes is belonged to:
P(Ci)P(F1|Ci)P(F2|Ci)...P(Fn|Ci) (formula 1)
In formula 1,
Wherein, k takes the natural number between 1~n;It is characterized word FkIn mail classes CiMail data sample set in go out
Existing number of times;For mail classes CiFeature lexicon in each Feature Words in mail classes CiMail data sample set in go out
Existing number of times sum;For mail classes CiMail data sample set in sample post quantity;S is each mail classes
The quantity sum of the sample post in mail data sample set.
Wherein, the feature lexicon of the mail classes is obtained according to following method:
For each mail classes, participle is carried out to the sample post in the mail data sample set of the mail classes, and
Number of times that each word after participle occurs in the mail data sample set of the mail classes is counted as the word of the word
Frequently;After removing the uncommon word and stop words in each word after participle, by word frequency more than setting lower threshold, less than the setting upper limit
The word of threshold value is defined as the alternative word of the mail classes;By part-of-speech information in the alternative word of the mail classes and part-of-speech information table
The alternative word that the part-of-speech information of middle record matches, is defined as the Feature Words of the mail classes, each Feature Words of the mail classes
Constitute the feature lexicon of the mail classes;
Wherein, the mail data sample set of each mail classes is the similarity between the characteristic vector according to sample post,
Based on clustering algorithm mark off come.
It is preferred that the Feature Words of the mail to be sorted are specifically included:Carry from the mail header of the mail to be sorted
The title feature word of taking-up, and the content characteristic word extracted from the Mail Contents of the mail to be sorted;And
The Feature Words according to mail to be sorted, calculate the probability that the mail to be sorted belongs to the mail classes,
Specifically include:
According to the title feature word of the mail to be sorted, the mail header for calculating the mail to be sorted belongs to the postal
After the probability of part classification, using the probability as to should mail classes title probability;And
According to the content characteristic word of the mail to be sorted, the Mail Contents for calculating the mail to be sorted belong to the postal
After the probability of part classification, using the probability as to should mail classes content probability;And
The probability by each mail classes of the correspondence for calculating is ranked up, and judges the feature of the mail to be sorted
Whether at least one keyword of mail classes maximum probability corresponding to is included in word;If so, then by the postal to be sorted
Part is divided in the mail classes corresponding to the probability of maximum, is specifically included:
The title probability of each mail classes of the correspondence for calculating is ranked up, if judging the title of the mail to be sorted
Feature Words include at least one keyword of the mail classes corresponding to the title probability of maximum, then by maximum title probability
To be determined mail classes of the corresponding mail classes as correspondence mail header;And
The content probability of each mail classes of the correspondence for calculating is ranked up, if judging the content of the mail to be sorted
Feature Words include the keyword of the mail classes corresponding to the content probability of maximum, then by corresponding to maximum content probability
To be determined mail classes of the mail classes as correspondence Mail Contents;
If the mail classes to be determined of the mail classes to be determined and the corresponding Mail Contents for corresponding to mail header
It is identical, then the mail to be sorted is divided in the mail classes to be determined.
It is preferred that it is described calculate maximum probability and sequence second probability difference, and calculate the difference with most
After the ratio of big probability, also include:
If judging the described difference and the ratio of the probability of maximum not less than the setting rate threshold value, treat described point
Class mail is defined as talking with mail;
If judging, the described difference is less than the setting rate threshold value, and the postal to be sorted with the ratio of the probability of maximum
Do not include the keyword of the mail classes corresponding to the probability of sequence second in the Feature Words of part, then:
The described difference is calculated into maximum with the ratio of the probability of maximum as after the first class probability rate, further
The difference of probability and the probability of sequence the 3rd, using the ratio of the difference and the probability of maximum as the second class probability rate;If
Determine that the second class probability rate includes sequence less than described setting in rate threshold value, and the Feature Words of the mail to be sorted
At least one keyword of the mail classes corresponding to the 3rd probability, then be divided into sequence the 3rd by the mail to be sorted
In mail classes corresponding to probability.
According to another aspect of the present invention, a kind of mail sorter is additionally provided, including:
Probability evaluation entity, for for predetermined each mail classes, according to the Feature Words of mail to be sorted, meter
Calculate the mail to be sorted to belong to after the probability of the mail classes, using the probability for calculating as to should mail classes it is general
Rate;
Order module, for the probability of each mail classes of the correspondence for calculating to be ranked up, obtains ranking results;
Category division module, for whether including in the ranking results most in the Feature Words for judging the mail to be sorted
At least one keyword of the mail classes corresponding to big probability;If so, then by the mail to be sorted be divided into maximum
In mail classes corresponding to probability;Otherwise:Probability maximum in the ranking results is calculated with the probability of sequence second
After difference, the ratio of the difference and the probability of maximum is calculated;If it is determined that the ratio for calculating is less than setting rate threshold value, and it is described
At least one keyword of the mail classes corresponding to the probability of sequence second is included in the Feature Words of mail to be sorted, then will
The mail to be sorted is divided in the mail classes corresponding to the probability of sequence second.
Further, the mail sorter, also includes:
There is ratio anticipation module in Feature Words, for for predetermined each mail classes, determining described treat point
The number of the Feature Words being contained in the Feature Words of class mail in the feature lexicon of the mail classes, calculate the number determined with
The total ratio of the Feature Words of the mail to be sorted, goes out as Feature Words of the mail to be sorted under the mail classes
Existing ratio;And confirm that rate threshold of the ratio more than setting occur in Feature Words of the mail to be sorted under the mail classes
When, trigger the probability evaluation entity.
If it is preferred that the category division module is additionally operable to judge that the described difference is not less than with the ratio of the probability of maximum
The setting rate threshold value, then be defined as in the mail to be sorted talking with mail;If judging, the described difference is general with maximum
The ratio of rate is less than the probability institute for not including sequence second in setting the rate threshold value, and the Feature Words of the mail to be sorted
The keyword of corresponding mail classes, then:Using the ratio of the described difference and the probability of maximum as the first class probability rate
Afterwards, the difference of probability maximum in the ranking results and the probability of sequence the 3rd is further calculated, by the difference and maximum
The ratio of probability is used as the second class probability rate;It is determined that the second class probability rate less than it is described setting rate threshold value, and
At least one keyword of the mail classes corresponding to the probability of sequence the 3rd is included in the Feature Words of the mail to be sorted
In the case of, the mail to be sorted is divided in the mail classes corresponding to the probability of sequence the 3rd.
In technical scheme, due to setting keyword respectively for each mail classes, by mail to be sorted category
In the probability of each mail classes, mail classification is carried out in combination with the keyword of mail classes, so as to avoid mail to be sorted
In the impact of accuracy that mail is classified of some non-key words, and the meter of the class probability rate based on mail to be sorted
Calculate, when being divided in mail to be sorted in mail classes corresponding to the probability of maximum, it is ensured that mail classification still has
There is higher accuracy.
Further, there is the calculating of ratio in Feature Words of the mail to be sorted in the present invention under each mail classes, can
Calculating in simplify mail assorting process, and the accuracy of certified mail classification;And, respectively according to the postal of mail to be sorted
Part theme and Mail Contents carry out mail classification, can be further ensured that the accuracy that mail is classified.
Description of the drawings
Fig. 1 is the stream of the method for the determination mail classes and its mail data sample set and feature lexicon of the embodiment of the present invention
Cheng Tu;
Fig. 2 a, 2b are the flow chart of the process for sorting mailings of the embodiment of the present invention;
Fig. 3 is the internal structure block diagram of the mail sorter of the embodiment of the present invention.
Specific embodiment
To make the objects, technical solutions and advantages of the present invention become more apparent, referring to the drawings and preferred reality is enumerated
Example is applied, the present invention is described in more detail.However, it is necessary to explanation, many details listed in specification are only to be
Reader is set to have a thorough explanation to the one or more aspects of the present invention, even without these specific details can also
Realize the aspects of the invention.
The term such as " module " used in this application, " system " is intended to include the entity related to computer, for example but does not limit
In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to:Process
Process, processor, object, executable program, the thread for performing, program and/or the computer run on device.For example, count
The application program and this computing device run on calculation equipment can be module.One or more modules may be located at executory
In one process and/or thread, a module can also be located on a computer and/or be distributed in two or more multiple stage is calculated
Between machine.
It was found by the inventors of the present invention that be the reason for the method erroneous judgement mail of prior art, when the mail of certain envelope mail
When including the feature of more certain not representative mail classes in content, the mail for calculating is may be such that
Belong to the maximum probability of the mail classes, if the mail is divided in the mail classes may be inaccurate.For example, if two
Dialogue mail between friend, refers to the situation that inquiry works each other, and causes to include welfare, treatment, position in Mail Contents
Deng word, and these words may belong to some features of recruitment class mail, and the method for prior art may be missed the postal
Part is divided in recruitment class mail.
Thereby take into account, can in advance be respectively each mail classes setting classifying rules, will some be more representative
Word be set as the keyword of mail classes.For example, one or several words of " work ", " resume ", " recruitment " etc. are set
It is set to the keyword of recruitment class mail.So, obtain mail to be sorted and belong to the probability of each mail classes, and determine maximum
Probability corresponding to mail classes after, first judge the key for whether including the mail classes in the Feature Words of mail to be sorted
Word, shows that mail to be sorted does not meet the classifying rules of the mail classes if not, can be according to the probability sorted in front two
Difference (referred to herein as class probability rate) and come the keyword of mail classes corresponding to second probability, it is determined that
Whether mail to be sorted is divided in the mail classes corresponding to the probability for coming second.So as to the pass based on mail classes
Keyword and class probability rate, can more precisely treat mail classifying and be classified.
Describe technical scheme in detail below in conjunction with the accompanying drawings.In the embodiment of the present invention, mail is carried out it is classified
Before, some mail classes can be predefined out (such as business's news, social activity, bank card, recruitment information, sequence information, registration letter
Breath, news) and each mail classes mail data sample set and feature lexicon, so as in predetermined mail classes
On the basis of, treat mail classifying and classified.Specifically, some mail classes and the mail of each mail classes are predefined
The flow process of the method for set of data samples and feature lexicon, as shown in figure 1, specifically including following steps:
S101:For each sample post in mail set to be trained, the set of words of the sample post is obtained, according to
The set of words of each sample post for obtaining determines the set of words of mail set to be trained, and then determines the sample post
Word feature vector.
Specifically, can be from the non-conversational postal that quantity is extracted in setting time section or set enter in mailbox of mail server
The sample post of part, using these sample posts as set element mail set to be trained is constituted.For mail set to be trained
In each sample post, participle is carried out to the mail data (including mail header and Mail Contents) of the sample post, and go
Stop words and rarely used word in each word marked off except Jing participles, obtains the set of words of the sample post.Will postal be trained
The set of words of each sample post in part set is merged into same set of words, that is, get rid of the word of each sample post
Because repeating and the word of redundancy in set, the set of words of the mail set to be trained is obtained.
For each sample post in mail set to be trained, by the word in the set of words of mail set to be trained
Sum as the word feature vector of the sample post dimension, and by each word in the set of words of mail set to be trained
Language, corresponds to respectively each vector element of the word feature vector of the sample post;For the sample post word feature to
Each vector element in amount, the determination method of the vector element value is as follows:If the mail to be trained corresponding to the vector element
Word in the set of words of set is included in the set of words of the sample post, then the vector element value is set to 1;Otherwise
The vector element value is set to 0.For example, the word feature vector embodiments of a sample post in mail set to be trained
For D=[d1,…,dj,..,dL], wherein djValue be 1 or 0, take in 1 set of words for representing mail set to be trained the
J word is included in the set of words of current sample post, is taken j-th in 0 set of words for representing mail set to be trained
Word is not included in the set of words of current sample post;Wherein, j is the natural number of 1~L, and L is mail set to be trained
The word sum of set of words.
S102:Similarity between the word feature vector of the sample post in mail set to be trained, using poly-
The sample post that class algorithm is treated in training mail set is clustered, and obtains some clusters.
Specifically, cosine similarity computational methods can be generally adopted, the word for calculating any two sample post is special
Levy the similarity between the similarity between vector, that is, any two sample post.For example, sample post x and sample postal
The word feature vector of part y is respectively X=[x1,…,xj,..,xL] and Y=[y1,…,yj,..,yL], can be according to equation below 2
Calculate similarity Sim (X, Y) between the characteristic vector of sample post x and sample post y:
So, in this step, can be according between the word feature of the sample post in mail set to be trained vector
Similarity, builds similarity matrix, and treats the sample in training mail set using clustering algorithm (such as hierarchical clustering algorithm)
This mail is clustered, and obtains meeting some clusters of cluster termination condition set in advance.For example, cluster termination condition can be set
The maximum similarity being set between cluster reaches setting similarity threshold, or the quantity of the sample post in cluster reaches setting number
Value.Wherein, build similarity matrix and cluster is carried out using clustering algorithm and be well known to those skilled in the art, herein no longer
Repeat.
S103:For each cluster that cluster is obtained, the sample post included in the cluster is divided into into same mail classes
In, and the sample post of each mail classes is constituted into the mail data sample set of the mail classes.
S104:For the mail data sample set of each mail classes, the mail data sample set of the mail classes is extracted
In sample post Feature Words, and then obtain the feature lexicon of the mail classes.
In this step, for the mail data sample set of each mail classes obtained in above-mentioned steps S103, extract
The Feature Words of the sample post in the mail data sample set of the mail classes, specially:Mail data to the mail classes
Each sample post in sample set carries out participle, counts mail data sample of each word after participle in the mail classes
The number of times of appearance is concentrated as the word frequency of the word;After removing the uncommon word and stop words in each word after participle, by word frequency
It is defined as the alternative word of the mail classes more than setting lower threshold, less than the word of setting upper limit threshold;By the mail classes
Alternative word in the alternative word that matches of the part-of-speech information that records in part-of-speech information and part-of-speech information table, be defined as the mail classes
Feature Words, each Feature Words of the mail classes constitute the feature lexicon of the mail classes.Wherein, participle is carried out to sample post
Namely participle is carried out to the mail header and Mail Contents of sample post;Promising raising mail classification is recorded in part-of-speech information table
Accuracy and the part of speech determined, such as adverbial word function idiom, noun, adjective, verb, time, place morpheme, measure word,
Filter out name, auxiliary word etc..
Based on predetermined each mail classes, the flow process of process for sorting mailings provided in an embodiment of the present invention, such as scheme
Shown in 2a, 2b, following steps are specifically included:
S201:For mail to be sorted, the Feature Words in the mail data of mail to be sorted are extracted;Make i=1.
Specifically, can treat mail classifying using existing segmentation methods carries out participle, removes each word after participle
In uncommon word and stop words after, obtain the Feature Words of mail to be sorted.
Increase simultaneously classification accuracy to simplify to calculate, can be right after step S201, and in step S202~S205
In predetermined each mail classes, calculate Feature Words of the mail to be sorted under the mail classes and ratio occur, and true
Make Feature Words of the mail to be sorted under the mail classes and ratio occur less than setting after rate threshold, calculate postal to be sorted
Part belongs to the probability of the mail classes.
S202:For predetermined i-th mail classes, spy of the mail to be sorted under i-th mail classes is calculated
Levy word and ratio occur.
Specifically, predetermined i-th mail classes can be included in the Feature Words for determining mail to be sorted
The number of the Feature Words in the feature lexicon of i-th mail classes, calculates the feature of the number and mail to be sorted determined
Are there is ratio by the total ratio of word in Feature Words of the ratio for calculating as mail to be sorted under i-th mail classes.
Wherein, ratio occur in Feature Words of the mail to be sorted under certain mail classes can reflect that the Feature Words of mail to be sorted are occurred in
Number in the feature lexicon of the mail classes, that is, can reflect that mail to be sorted belongs to the possibility of the mail classes;If
Feature Words of the mail to be sorted under certain mail classes occur that ratio is less, then mail to be sorted belongs to the general of the mail classes
Rate is less;Otherwise the probability that mail to be sorted belongs to the mail classes is larger.Wherein, i takes the natural number between 1~m, and m is pre-
The number of the mail classes for first determining.
S203:Feature Words of the mail to be sorted under i-th mail classes into ratio occur is carried out with setting rate threshold
Compare, and judge whether comparative result is that Feature Words of the mail to be sorted under i-th mail classes ratio occur more than setting
Rate threshold;If so, then execution step S204;Otherwise, step S205 is jumped to.
That is, judging that whether Feature Words of the mail to be sorted under i-th mail classes ratio occur more than setting
Rate threshold;If so, the probability that mail to be sorted belongs to i-th mail classes is then calculated;Otherwise, directly judge i whether etc.
In m, and the calculating that mail to be sorted belongs to the probability of i-th mail classes is not carried out, while classification accuracy is ensured,
Simplify calculating.
S204:According to the Feature Words of mail to be sorted, the probability that mail to be sorted belongs to i-th mail classes is calculated
Afterwards, using the probability for calculating as the probability for corresponding to i-th mail classes.
If comparative result is Feature Words of the mail to be sorted under i-th mail classes there is ratio more than setting ratio threshold
Value, then calculate the probability that mail to be sorted belongs to the mail classes.
Specifically, existing NB Algorithm can be based on, it is assumed that separate between the Feature Words of mail to be sorted,
And remember that i-th mail classes is Ci, n Feature Words of mail to be sorted are respectively F1,F2,...,Fn, then based on naive Bayesian
Algorithm, mail to be sorted belongs to i-th mail classes CiThe probability P (C that are represented by equation below 3i|F1,F2,...,
Fn):
Due to separate between the Feature Words of mail to be sorted, therefore:
P(F1,F2,...,Fn|Ci)=P (F1|Ci)P(F2|Ci)...P(Fn|Ci);
And:
P(F1,F2,...,Fn)=P (F1)P(F2)...P(Fn);
For each mail classes, P (F1,F2,...Fn) it is identical, therefore:
P(Ci|F1,F2,...Fn)∝P(Ci)P(F1|Ci)P(F2|Ci)...P(Fn|Ci);
So as to P (C can will be calculatedi|F1,F2,..Fn) it is converted into calculating P (Ci) and P (Fk|Ci), therefore, can calculate as follows
The value of formula 1, as mail to be sorted the probability of i-th mail classes is belonged to:
P(Ci)P(F1|Ci)P(F2|Ci)...P(Fn|Ci) (formula 1)
In formula 1,
Wherein, k takes the natural number between 1~n;It is characterized word FkIn mail classes CiMail data sample set in go out
Existing number of times;For mail classes CiFeature lexicon in each Feature Words in mail classes CiMail data sample set in go out
Existing number of times sum;For mail classes CiMail data sample set in sample post quantity;S is each mail classes
The quantity sum of the sample post in mail data sample set.
And, there is the use of this judgement factor of ratio in the Feature Words of mail to be sorted, based on above-mentioned simple shellfish
Leaf this algorithm carries out mail to be sorted and belongs to i-th mail classes CiProbability calculating and and then carry out mail classify when, keep away
Exempt from mail to be sorted certain Feature Words in mail classes CiMail data sample set in the number of times that occurs it is higher and affect
The situation that mail classes determine;For example, Feature Words F1In mail classes CiMail data sample set in occur number of times it is very big,
And further feature base is not originally in mail classes CiMail data sample set in occur, then may be because of P (F1|Ci) it is larger and
So that P (Ci)P(F1|Ci)P(F2|Ci)...P(Fn|Ci)P(F1|Ci) larger, and then cause the classification of mail to be sorted not accurate enough
Really, and there is ratio in Feature Words that this judges that the use of the factor can well avoid the appearance of this kind of situation.
Additionally, being also obtained after the word feature of mail to be sorted vector, the word feature vector of mail classifying is treated
In each element be normalized, and calculate the characteristic vector of mail to be sorted and i-th mail classes CiIn each sample
Similarity between the characteristic vector of mail, and then calculate the mean value of each similarity, using the mean value for calculating as treating point
Class mail belongs to the probability of i-th mail classes.
S205:Judge whether i is equal to m;If so, then execution step S206;Otherwise, make after i=i+1, jump to step
S202。
Specifically, m is the number of predetermined mail classes, if i=m, shows to treat mail classifying at each
There is ratio and are calculated in Feature Words under mail classes.If i ≠ m, i=i+1 is made, jump to step S202, calculated and treat point
There is ratio in Feature Words of the class mail under next (i+1) mail classes.
S206:After the probability of each mail classes of the correspondence for calculating is ranked up, the Feature Words of mail to be sorted are judged
In whether include at least one keyword of mail classes corresponding to maximum probability;If so, then execution step S207;It is no
Then, execution step S210.
Specifically, for each mail classes, it is previously stored with the antistop list of the mail classes, and the mail classes
Keyword in antistop list is typically predetermined, is specifically as follows:For each mail classes, for the mail classes
Feature lexicon in each Feature Words, count in advance in the mail classes comprising this feature word sample post quantity simultaneously
Carry out descending sequence;Using the Feature Words of the forward setting number of sorting as the mail classes keyword.Or, may be used also
Rule of thumb to be set respectively to the keyword of each mail classes by those skilled in the art.For example, by " work ",
" resume " " recruitment " etc. is set as recruiting the keyword of class mail.
In this step, the probability of each mail classes of the correspondence for calculating is carried out into descending sequence, and judges to treat point
Whether include at least one keyword of the mail classes corresponding to maximum probability in the Feature Words of class mail, that is, determine
Whether mail to be sorted meets the classifying rules of the mail classes corresponding to the probability of maximum;If in the Feature Words of mail to be sorted
Including one or more in the keyword of the mail classes corresponding to maximum probability, then show that mail to be sorted meets maximum
Probability corresponding to mail classes classifying rules, directly mail to be sorted can be divided in the mail classes;If treating point
Do not include the keyword of the mail classes corresponding to maximum probability in the Feature Words of class mail, then show to draw in mail to be sorted
Assign to maximum probability corresponding to mail classes in it is not accurate enough, can be according to following step S210~S216 process.
S207:Mail to be sorted is divided in the mail classes corresponding to the probability of maximum.
If it is determined that the Feature Words of mail to be sorted are included in the keyword of the mail classes corresponding to the probability of maximum
One or more, then be divided into mail to be sorted in the mail classes of maximum probability.
S210:Make h=2.Wherein, h is the natural number more than or equal to 2 and less than or equal to m.
S211:After calculating the difference of the probability of maximum and the probability of sequence h, the difference is calculated with the probability of maximum
Ratio, as h-1 class probability rates.
For example, during h=2, after calculating the difference of the probability of maximum and the probability of sequence second, the difference is calculated with maximum
Probability ratio, as the first class probability rate;If it is, the probability of maximum is P1, the probability of sequence second is P2,
First class probability rate of mail to be sorted is Pd1=(P1-P2)/P1。
For another example, the probability of sequence the 3rd is P3, then the second class probability rate is Pd2=(P1-P3)/P1。
S212:Judge h-1 class probabilities rate whether less than setting rate threshold value;If so, then execution step S213;It is no
Then, execution step S216.
Wherein, setting rate threshold value can be set by those skilled in the art according to the situation that actual mail is classified.
S213:Judge in the Feature Words of mail to be sorted whether to include mail classes corresponding to the probability of sequence h
At least one keyword;If so, then execution step S214;Otherwise, execution step S215.
For example, during h=2, if the first class probability rate judges the feature of mail to be sorted less than setting rate threshold value
Whether at least one keyword of the mail classes probability of sequence second corresponding to is included in word.
S214:Mail to be sorted is divided in the mail classes corresponding to the probability of sequence h.
If the h-1 class probabilities rate of mail to be sorted is judged in step S212 less than setting rate threshold value,
Mail to be sorted is divided in the mail classes corresponding to the probability of sequence h.
For example, during h=2, judge that the first class probability rate is less than setting rate threshold value, and the feature of mail to be sorted
At least one keyword of the mail classes corresponding to the probability of sequence second is included in word, is then divided in mail to be sorted
In mail classes corresponding to the probability of sequence second.
For another example, during h=3, it is determined that the second class probability rate is less than setting rate threshold value, and the feature of mail to be sorted
Include in word sequence the 3rd probability corresponding to mail classes at least one keyword in the case of, by mail to be sorted
It is divided in the mail classes corresponding to the probability of sequence the 3rd.
S215:Judge whether h is equal to m;If so, then execution step S216;Otherwise, make after h=h+1, jump to step
S211。
Specifically, if h-1 class probabilities rate is not wrapped less than setting in rate threshold value, and the Feature Words of mail to be sorted
The keyword of the mail classes corresponding to the probability of sequence h is included, then further calculates maximum probability general with sequence h+1
The difference of rate, using the ratio of the difference and the probability of maximum as h class probability rates, and according to h class probability rates
Mail is classified.
S216:Mail to be sorted is defined as to talk with mail.
If the h-1 class probabilities rate that mail to be sorted is judged in step S212 sets poor not less than (being more than or equal to)
Rate threshold value, then by mail to be sorted be defined as common dialogue mail in this step.For example, during h=2, if the first classification is general
Rate rate is then defined as in mail to be sorted to talk with mail not less than setting rate threshold value.
Or, after judging h=m in step S215, in this step mail to be sorted is defined as into common dialogue
Mail.
More preferably, the present invention can also treat the mail header and Mail Contents of mail classifying and carry out participle respectively, from treating
Title feature word is extracted in the mail header of mail classifying, and content characteristic is extracted from the Mail Contents of mail to be sorted
Word;In other words, the Feature Words of mail to be sorted specifically include title feature word and content characteristic word.And, treat point extracting
After the title feature word and content characteristic word of class mail, can be calculated to be sorted according to the title feature word of mail to be sorted
The mail header of mail belongs to the probability of the mail classes, using the probability as to should mail classes title probability;And root
According to the content characteristic word of mail to be sorted, the Mail Contents for calculating mail to be sorted belong to the probability of the mail classes, by this
Probability as to should mail classes content probability.Afterwards, the title probability of each mail classes of the correspondence for calculating is carried out
Sequence, if judging the title feature word of mail to be sorted at least of the mail classes corresponding to the title probability of maximum is included
Individual keyword, then using the mail classes corresponding to maximum title probability as the mail classes to be determined for corresponding to mail header;
And be ranked up the content probability of each mail classes of the correspondence for calculating, if wrapping in judging the content characteristic word of mail to be sorted
The keyword of the mail classes corresponding to the content probability of maximum is included, is then made the mail classes corresponding to maximum content probability
For the mail classes to be determined of correspondence Mail Contents;If the mail classes to be determined and corresponding Mail Contents of correspondence mail header
Mail classes to be determined are identical, then mail to be sorted is divided into the postal to be determined of correspondence mail header or correspondence Mail Contents
In part classification;Otherwise, mail to be sorted is divided into into dialogue mail.So, if the error probability of mail classification is Pe, then it is based on
It is P that the mail header and Mail Contents of mail to be sorted carries out respectively the error probability of the method for mail discriminant classificatione 2, so as to,
The error rate classified, that is, the accuracy for improving mail classification can be reduced.
More preferably, some senders would generally send the sample post of some one or more mail classes, therefore this
Also the sender of the sample post of each mail classes can be recorded in bright, when mail to be sorted is received, can basis
The sender of mail to be sorted, determines the mail classes belonging to the sample post sent before the sender, directly calculates and treats
Mail classifying belongs to the probability of these mail classes, the probability more than setting probability threshold value and maximum is determined, by postal to be sorted
Part is divided in the mail classes corresponding to the probability, so as to carry out mail classification based on part people, and can simplify calculating.
Based on above-mentioned process for sorting mailings, the internal structure block diagram of the mail sorter of the embodiment of the present invention, such as Fig. 3
It is shown, specifically include:Probability evaluation entity 301, category division module 302 and order module 304.
Wherein, probability evaluation entity 301 is used for for predetermined each mail classes, according to the spy of mail to be sorted
Word is levied, after calculating the probability that mail to be sorted belongs to the mail classes, using the probability for calculating as to should mail classes
Probability.
Order module 304 is used to be ranked up the probability of each mail classes of the correspondence for calculating, and obtains ranking results.
Whether category division module 302 is used to judging including in the Feature Words of mail to be sorted maximum general in ranking results
At least one keyword of the mail classes corresponding to rate;If so, then by mail to be sorted be divided into maximum probability corresponding to
Mail classes in;Otherwise:After the difference of the probability for calculating in ranking results maximum probability and sequence second, the difference is calculated
The ratio of value and maximum probability;If the ratio that judgement is calculated is less than setting rate threshold value, and the Feature Words of mail to be sorted
In include sequence second probability corresponding to mail classes at least one keyword, then mail to be sorted is divided into into row
In mail classes corresponding to the probability of sequence second.
Further, if category division module 302 is additionally operable to judge that the ratio for calculating, will not less than setting rate threshold value
Mail to be sorted is defined as talking with mail;If the ratio that judgement is calculated is less than setting rate threshold value, and the spy of mail to be sorted
The keyword of the mail classes corresponding to the probability for not including sequence second in word is levied, then:Using the ratio for calculating as first
After class probability rate, the difference of probability maximum in ranking results and the probability of sequence the 3rd is further calculated, by the difference
With the ratio of maximum probability as the second class probability rate;It is determined that the second class probability rate is less than setting rate threshold
Value, and include in the Feature Words of mail to be sorted sequence the 3rd probability corresponding to mail classes at least one keyword
In the case of, mail to be sorted is divided in the mail classes corresponding to the probability of sequence the 3rd.
Further, above-mentioned mail sorter may also include:There is ratio anticipation module 303 in Feature Words.
There is ratio anticipation module 303 for for predetermined each mail classes, determining to be sorted in Feature Words
The number of the Feature Words being contained in the Feature Words of mail in the feature lexicon of the mail classes, calculates the number and institute determined
The total ratio of the Feature Words of mail to be sorted is stated, ratio occur in the Feature Words as mail to be sorted under the mail classes
Rate;And when confirming that Feature Words of the mail to be sorted under the mail classes ratio occur more than the rate threshold for setting, triggering is general
Rate computing module 301.Correspondingly, probability evaluation entity 301 calculates mail category to be sorted according to the Feature Words of mail to be sorted
In the probability of the mail classes, using the probability for calculating as to should mail classes probability.
Wherein, the function that each module of mail sorter is realized refers to the mail point shown in above-mentioned Fig. 2 a, 2b
Described in class method and step.
In technical scheme, due to setting keyword respectively for each mail classes, by mail to be sorted category
In the probability of each mail classes, mail classification is carried out in combination with the keyword of mail classes, so as to avoid mail to be sorted
In the impact of accuracy that mail is classified of some non-key words, and the meter of the class probability rate based on mail to be sorted
Calculate, when being divided in mail to be sorted in mail classes corresponding to the probability of maximum, it is ensured that mail classification still has
There is higher accuracy.
Further, there is the calculating of ratio in Feature Words of the mail to be sorted in the present invention under each mail classes, can
Calculating in simplify mail assorting process, and the accuracy of certified mail classification;And, respectively according to the postal of mail to be sorted
Part theme and Mail Contents carry out mail classification, can be further ensured that the accuracy that mail is classified.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of process for sorting mailings, it is characterised in that include:
For predetermined each mail classes, according to the Feature Words of mail to be sorted, the mail category to be sorted is calculated
After the probability of the mail classes, using the probability for calculating as to should mail classes probability;
The probability of each mail classes of the correspondence for calculating is ranked up, and whether is judged in the Feature Words of the mail to be sorted
Including at least one keyword of the mail classes corresponding to maximum probability;If so, then the mail to be sorted is divided into
In mail classes corresponding to maximum probability;Otherwise:
The difference of the probability of maximum and the probability of sequence second is calculated, and calculates the ratio of the difference and the probability of maximum;If
The ratio that judgement is calculated include in rate threshold value, and the Feature Words of the mail to be sorted the general of sequence second less than setting
At least one keyword of the mail classes corresponding to rate, the then probability institute for the mail to be sorted being divided into into sequence second is right
In the mail classes answered.
2. the method for claim 1, it is characterised in that described to calculate the mail to be sorted and belong to the mail classes
Probability before, also include:
The number of the Feature Words being contained in the Feature Words for determining the mail to be sorted in the feature lexicon of the mail classes,
The total ratio of the number and Feature Words of the mail to be sorted determined is calculated, as the mail to be sorted in the postal
There is ratio in Feature Words under part classification;And it is big to confirm that ratio occur in Feature Words of the mail to be sorted under the mail classes
In the rate threshold of setting.
3. method as claimed in claim 2, it is characterised in that the keyword of the mail classes is predetermined:
For each mail classes, for each Feature Words in the feature lexicon of the mail classes, the mail is counted in advance
The quantity of the sample post for including this feature word in classification simultaneously carries out descending sequence;By the spy of forward setting number that sorts
Word is levied as the keyword of the mail classes.
4. method as claimed in claim 3, it is characterised in that for predetermined each mail classes, according to be sorted
The Feature Words of mail, calculate the probability that the mail to be sorted belongs to the mail classes, specifically include:
Remember that i-th mail classes is Ci, n Feature Words of the mail to be sorted are respectively F1,F2,...,Fn, calculate as follows
The value of formula 1, as the mail to be sorted the probability of i-th mail classes is belonged to:
P(Ci)P(F1|Ci)P(F2|Ci)...P(Fn|Ci) (formula 1)
In formula 1,
Wherein, k takes the natural number between 1~n;It is characterized word FkIn mail classes CiMail data sample set in occur
Number of times;For mail classes CiFeature lexicon in each Feature Words in mail classes CiMail data sample set in occur
Number of times sum;For mail classes CiMail data sample set in sample post quantity;S is the mail of each mail classes
The quantity sum of the sample post that data sample is concentrated.
5. method as claimed in claim 4, it is characterised in that the feature lexicon of the mail classes is obtained according to following method
Arrive:
For each mail classes, participle is carried out to the sample post in the mail data sample set of the mail classes, and counted
Word frequency of the number of times that each word gone out after participle occurs in the mail data sample set of the mail classes as the word;Go
After the uncommon word in each word and stop words after except participle, by word frequency more than setting lower threshold, less than setting upper limit threshold
Word be defined as the alternative word of the mail classes;To remember in part-of-speech information in the alternative word of the mail classes and part-of-speech information table
The alternative word that the part-of-speech information of record matches, is defined as the Feature Words of the mail classes, each Feature Words composition of the mail classes
The feature lexicon of the mail classes;
Wherein, the mail data sample set of each mail classes is the similarity between the characteristic vector according to sample post, is based on
Clustering algorithm marks off what is come.
6. the method as described in claim 4 or 5, it is characterised in that the Feature Words of the mail to be sorted are specifically included:From institute
The title feature word extracted in the mail header for stating mail to be sorted, and carry from the Mail Contents of the mail to be sorted
The content characteristic word of taking-up;And
The Feature Words according to mail to be sorted, calculate the probability that the mail to be sorted belongs to the mail classes, specifically
Including:
According to the title feature word of the mail to be sorted, the mail header for calculating the mail to be sorted belongs to the mail class
After other probability, using the probability as to should mail classes title probability;And
According to the content characteristic word of the mail to be sorted, the Mail Contents for calculating the mail to be sorted belong to the mail class
After other probability, using the probability as to should mail classes content probability;And
The probability by each mail classes of the correspondence for calculating is ranked up, and judges in the Feature Words of the mail to be sorted
Whether at least one keyword of mail classes maximum probability corresponding to is included;If so, then by the mail to be sorted stroke
In assigning to the mail classes corresponding to the probability of maximum, specifically include:
The title probability of each mail classes of the correspondence for calculating is ranked up, if judging the title feature of the mail to be sorted
Word includes at least one keyword of the mail classes corresponding to the title probability of maximum, then maximum title probability institute is right
To be determined mail classes of the mail classes answered as correspondence mail header;And
The content probability of each mail classes of the correspondence for calculating is ranked up, if judging the content characteristic of the mail to be sorted
Word includes the keyword of the mail classes corresponding to the content probability of maximum, then by the mail corresponding to maximum content probability
To be determined mail classes of the classification as correspondence Mail Contents;
If the mail classes to be determined of the correspondence mail header are identical with the mail classes to be determined of the correspondence Mail Contents,
Then the mail to be sorted is divided in the mail classes to be determined.
7. the method as described in claim 1-5 is arbitrary, it is characterised in that in the probability for calculating maximum and sequence second
Probability difference, and calculate the difference with maximum probability ratio after, also include:
If judging, the described difference is not less than the setting rate threshold value with the ratio of the probability of maximum, by the postal to be sorted
Part is defined as talking with mail;
If the described difference and the ratio of the probability of maximum are judged less than the setting rate threshold value, and the mail to be sorted
Do not include the keyword of the mail classes corresponding to the probability of sequence second in Feature Words, then:
The described difference is calculated into maximum probability with the ratio of the probability of maximum as after the first class probability rate, further
With sequence the 3rd probability difference, using the difference with maximum probability ratio as the second class probability rate;If it is determined that
Second class probability rate include in rate threshold value, and the Feature Words of the mail to be sorted sequence the 3rd less than described setting
Probability corresponding to mail classes at least one keyword, then by the mail to be sorted be divided into sequence the 3rd probability
In corresponding mail classes.
8. a kind of mail sorter, it is characterised in that include:
Probability evaluation entity, for for predetermined each mail classes, according to the Feature Words of mail to be sorted, calculating
The mail to be sorted belongs to after the probability of the mail classes, using the probability for calculating as to should mail classes probability;
Order module, for the probability of each mail classes of the correspondence for calculating to be ranked up, obtains ranking results;
Category division module, for whether including maximum in the ranking results in the Feature Words for judging the mail to be sorted
At least one keyword of the mail classes corresponding to probability;If so, then by the mail to be sorted be divided into maximum probability
In corresponding mail classes;Otherwise:The difference of probability maximum in the ranking results and the probability of sequence second is calculated,
And calculate the difference with maximum probability ratio;If it is determined that the ratio for calculating is less than setting rate threshold value, and described treat point
At least one keyword of the mail classes corresponding to the probability of sequence second is included in the Feature Words of class mail, then will be described
Mail to be sorted is divided in the mail classes corresponding to the probability of sequence second.
9. device as claimed in claim 8, it is characterised in that also include:
There is ratio anticipation module in Feature Words, for for predetermined each mail classes, determining the postal to be sorted
The number of the Feature Words being contained in the Feature Words of part in the feature lexicon of the mail classes, calculate the number determined with it is described
There is ratio in the total ratio of the Feature Words of mail to be sorted, the Feature Words as the mail to be sorted under the mail classes
Rate;And when confirming that Feature Words of the mail to be sorted under the mail classes ratio occur more than the rate threshold for setting, touch
Send out probability evaluation entity described.
10. device as claimed in claim 8 or 9, it is characterised in that
If the category division module is additionally operable to judge that the described difference is poor not less than the setting with the ratio of the probability of maximum
Rate threshold value, then be defined as in the mail to be sorted talking with mail;If judging, the described difference is little with the ratio of the probability of maximum
The mail corresponding to the probability of sequence second is not included in setting the rate threshold value, and the Feature Words of the mail to be sorted
The keyword of classification, then:Using the ratio of the described difference and the probability of maximum as after the first class probability rate, further count
The difference of probability maximum in the ranking results and the probability of sequence the 3rd is calculated, the difference is made with the ratio of the probability of maximum
For the second class probability rate;It is determined that the second class probability rate sets rate threshold value, and the postal to be sorted less than described
Include in the Feature Words of part sequence the 3rd probability corresponding to mail classes at least one keyword in the case of, by institute
State mail to be sorted to be divided in the mail classes corresponding to the probability of sequence the 3rd.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410163082.1A CN103984703B (en) | 2014-04-22 | 2014-04-22 | Mail classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410163082.1A CN103984703B (en) | 2014-04-22 | 2014-04-22 | Mail classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103984703A CN103984703A (en) | 2014-08-13 |
CN103984703B true CN103984703B (en) | 2017-04-12 |
Family
ID=51276676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410163082.1A Active CN103984703B (en) | 2014-04-22 | 2014-04-22 | Mail classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103984703B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915453A (en) * | 2015-07-01 | 2015-09-16 | 北京奇虎科技有限公司 | Method, device and system for classifying POI information |
CN104899339A (en) * | 2015-07-01 | 2015-09-09 | 北京奇虎科技有限公司 | Method and system for classifying POI (Point of Interest) information |
CN107528763A (en) * | 2016-06-22 | 2017-12-29 | 北京易讯通信息技术股份有限公司 | A kind of Mail Contents analysis method based on Spark and YARN |
CN106130880A (en) * | 2016-07-05 | 2016-11-16 | 马岩 | The gathering method of network mail data and system |
CN106453033B (en) * | 2016-08-31 | 2019-03-15 | 电子科技大学 | Multi-level process for sorting mailings based on Mail Contents |
CN109615153B (en) * | 2017-09-26 | 2023-06-16 | 阿里巴巴集团控股有限公司 | Merchant risk assessment method, device, equipment and storage medium |
CN107644101B (en) * | 2017-09-30 | 2020-11-13 | 百度在线网络技术(北京)有限公司 | Information classification method and device, information classification equipment and computer readable medium |
CN107657284A (en) * | 2017-10-11 | 2018-02-02 | 宁波爱信诺航天信息有限公司 | A kind of trade name sorting technique and system based on Semantic Similarity extension |
CN110750636A (en) * | 2018-07-04 | 2020-02-04 | 百度在线网络技术(北京)有限公司 | Network public opinion information processing method and device |
CN109379228A (en) * | 2018-11-02 | 2019-02-22 | 平安科技(深圳)有限公司 | Accidentally warning information recognition methods and device, storage medium, electric terminal |
CN111984736B (en) * | 2019-05-21 | 2024-03-29 | 腾讯科技(深圳)有限公司 | Object class detection method, device, readable storage medium and computer equipment |
CN110400123B (en) * | 2019-07-05 | 2023-06-20 | 中国平安财产保险股份有限公司 | Friend-making information popularization method, friend-making information popularization device, friend-making information popularization equipment and friend-making information popularization computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7487544B2 (en) * | 2001-07-30 | 2009-02-03 | The Trustees Of Columbia University In The City Of New York | System and methods for detection of new malicious executables |
CN103136266A (en) * | 2011-12-01 | 2013-06-05 | 中兴通讯股份有限公司 | Method and device for classification of mail |
CN103440242A (en) * | 2013-06-26 | 2013-12-11 | 北京亿赞普网络技术有限公司 | User search behavior-based personalized recommendation method and system |
WO2014036788A1 (en) * | 2012-09-07 | 2014-03-13 | 盈世信息科技(北京)有限公司 | A method for collecting and classification email |
-
2014
- 2014-04-22 CN CN201410163082.1A patent/CN103984703B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7487544B2 (en) * | 2001-07-30 | 2009-02-03 | The Trustees Of Columbia University In The City Of New York | System and methods for detection of new malicious executables |
CN103136266A (en) * | 2011-12-01 | 2013-06-05 | 中兴通讯股份有限公司 | Method and device for classification of mail |
WO2014036788A1 (en) * | 2012-09-07 | 2014-03-13 | 盈世信息科技(北京)有限公司 | A method for collecting and classification email |
CN103440242A (en) * | 2013-06-26 | 2013-12-11 | 北京亿赞普网络技术有限公司 | User search behavior-based personalized recommendation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN103984703A (en) | 2014-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103984703B (en) | Mail classification method and device | |
CN105787025B (en) | Network platform public account classification method and device | |
Meyer et al. | SpamBayes: Effective open-source, Bayesian based, email classification system. | |
CN107943941B (en) | Junk text recognition method and system capable of being updated iteratively | |
CN103514174B (en) | A kind of file classification method and device | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
Pendar | Toward spotting the pedophile telling victim from predator in text chats | |
CN103729474B (en) | Method and system for recognizing forum user vest account | |
WO2017173093A1 (en) | Method and device for identifying spam mail | |
CN110263248A (en) | A kind of information-pushing method, device, storage medium and server | |
CN113934941B (en) | User recommendation system and method based on multidimensional information | |
CN109933648B (en) | Real user comment distinguishing method and device | |
WO2017091985A1 (en) | Method and device for recognizing stop word | |
CN1687924A (en) | Method for producing internet personage information search engine | |
CN109446393B (en) | Network community topic classification method and device | |
CN111079029A (en) | Sensitive account detection method, storage medium and computer equipment | |
CN109062895A (en) | A kind of intelligent semantic processing method | |
CN105740232A (en) | Method and device for automatically extracting feedback hotspots | |
CN108462624B (en) | Junk mail identification method and device and electronic equipment | |
CN107704869B (en) | Corpus data sampling method and model training method | |
Gunawan et al. | Filtering spam text messages by using Twitter-LDA algorithm | |
Anitha et al. | Email spam filtering using machine learning based xgboost classifier method | |
WO2024055603A1 (en) | Method and apparatus for identifying text from minor | |
CN108475265B (en) | Method and device for acquiring unknown words | |
CN108073567A (en) | A kind of Feature Words extraction process method, system and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230417 Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193 Patentee after: Sina Technology (China) Co.,Ltd. Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor Patentee before: Sina.com Technology (China) Co.,Ltd. |