CN104142997A

CN104142997A - Bayes text classifier based on reverse word frequency

Info

Publication number: CN104142997A
Application number: CN201410376416.3A
Authority: CN
Inventors: 关丹辉
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-08-01
Filing date: 2014-08-01
Publication date: 2014-11-12

Abstract

The invention provides a Bayes text classifier based on the reverse word frequency. The reverse word frequency of words serves as the word distinction degree, the frequency of the words in different classes is weighted to obtain the comprehensive prior probability of the words; according to the Bayes theorem, posterior probabilities of the words belonging to different classes are calculated according to the prior probabilities, classification with the maximum posterior probability value is selected, and therefore the classification purpose is achieved. Compared with the prior art, according to the Bayes text classifier based on the reverse word frequency, at first, it is assumed that the occurrence probabilities of the words are independent from one another, the prior probability of each word is estimated according to a training data set, and therefore the posterior probabilities of the words appearing in testing documents and belonging to different class categories are calculated. The documents are classified into the specific classes on the basis of the maximum posterior probability value. The Bayes text classifier based on the reverse word frequency has the advantages of being reasonable in design, simple in structure, convenient to use and the like, thereby having the good using value.

Description

Bayes's text classifier based on reverse word frequency

Technical field

The present invention relates to information science and machine learning field, specifically a kind of Bayes's text classifier based on reverse word frequency.

Background technology

Current, data age arrives to be approved by industry gradually greatly, and large market demand also lands gradually.And at large data age, it is very powerful and exceedingly arrogant that the science such as data analysis, data mining and machine learning become, become the sharp weapon of large data age Denver Nuggets.Along with the surge of data volume, especially the obvious rising of text data, have increasing information accumulation, and the people who needs information also not special instrument easily go to extract suit the requirements succinct, refining, intelligible knowledge from the large-scale text message resource of multi-data source.The complicacy of text data and many scenes are used, and make text classification seem extremely important.No matter be news polymerization, Spam Classification, or microblogging content analysis, text classification all will be played the part of important role.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, a kind of Bayes's text classifier based on reverse word frequency is provided.

Text classification is very important research field in data mining and machine learning, and the target of text classification is that new document is marked with to suitable class label.First the process of text automatic classification is that the content of document in training set is analyzed, and constructs classification schemes, i.e. a sorter.After sorter study, each class has a different classification schemes, and available these classification schemes are to new document classification.

Naive Bayes Classifier is the Bayesian simple probability sorter of a kind of application based on independent hypothesis.The basis of Bayes's classification is probability inference, uncertain existing of various conditions exactly, only knows in the situation of its probability of occurrence, how to complete reasoning and decision task.In text classification, first we suppose that the probability of each word appearance is separate (although in real life, each word is also not exclusively separate, but Naive Bayes Classification effect is still very effective), estimate the prior probability of each word according to training dataset, calculate thus each word occurring in test document after, the posterior probability of the stepping that belongs to a different category.We are according to maximum a posteriori probability value, by document classification in concrete classification.

Technical scheme of the present invention realizes in the following manner, and its feature is the discrimination using the reverse word frequency of word as word, and weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;

Main consider aspect two of number of times that number of times that this word occurs in all documents and this word occur in this document; And according to law of great numbers, word Prior Probability represents at this classification medium frequency with word, and we have considered the discrimination of various words, in other words the prior probability now obtaining is not the probability that has purely represented that this word occurs in this classification, but has represented the comprehensive prior probability that occurs this word in this classification.

Advantage of the present invention is:

Bayes's text classifier based on reverse word frequency of the present invention compared to the prior art, first we suppose that the probability of each word appearance is separate, estimate the prior probability of each word according to training dataset, calculate thus each word occurring in test document after, the posterior probability of the stepping that belongs to a different category.We are according to maximum a posteriori probability value, by document classification in concrete classification, the feature such as it is reasonable in design, simple in structure, easy to use that the present invention has, thereby, there is good use value.

Embodiment

Below the Bayes's text classifier based on reverse word frequency of the present invention is described in detail below.

Bayes's text classifier based on reverse word frequency of the present invention, its feature is the discrimination using the reverse word frequency of word as word, weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;

Indicate explanation

Here, we are taking Spam Classification as example.Suppose that category-A is spam, category-B is non-spam, V _irepresent each word, thus, we make following sign:

Nums represents that total sample number Counts represents total word number

NumsA represents that spam number NumsB represents non-spam number

CountsA represents that in spam, total words SumB represents non-spam total words

CountsV _ia represent word V _ioccurrence number in spam

CountsV _ib represent word V _ioccurrence number in non-spam

P (A)=NumsA/Nums represents the probability that an envelope mail is spam

P (B)=NumsB/Nums represents the probability that an envelope mail is non-spam

P (V _i) represent to occur in all documents word V _iprobability

P (V _i︱ A)=CountsV _ia/ CountsA represents word V _ithe probability occurring in spam

P (V _i︱ B)=CountsV _ia/ CountsA represents word V _ithe probability occurring in non-spam

P (A ︱ V _i) be illustrated in word V _iwhen appearance, the probability that mail is spam

P (B ︱ V _i) be illustrated in word V _iwhen appearance, the probability that mail is non-spam

Bayes' theorem

According to Bayes' theorem, can obtain:

P(A︱V _i) = P(V _i︱A)* P(A)/ P(V _i)

P(B︱V _i) = P(V _i︱B)* P(B)/ P(V _i)

For the text of multiple word compositions:

P(A︱V _1-n) = P(V _1-n︱A)* P(A)/ P(V _1-n)

According to separate between each word of hypothesis, P (V so _1-n︱ A) * P (A)= therefore,

P(A︱V _1-n) = （1）

In like manner can obtain:

P(B︱V _1-n) = （2）

Therefore,, according to the thought of naive Bayesian, the frequency that need to occur in spam and non-spam according to each word, represents P (V _i︱ A) or P (V _i︱ B).And calculate the probability that occurs belonging to after these words spam and non-spam according to formula (1), (2).If P is (A ︱ V _1-n) > P (B ︱ V _1-n) this mail belong to spam, otherwise belong to non-spam.

Relatively formula (1), (2), we find, P (A ︱ V _1-n) and P (B ︱ V _1-n) molecule be all identical, so in actual computation, only need to calculate bulk of molecule and just can mutually compare.I.e. (Σ P (V _i︱ A)) * P (A) > (Σ P (V _i︱ B)) * P (B) is spam, otherwise is non-spam.

Based on the weight design of reverse word frequency

But at this, we consider: it is more frequent that word occurs in overall, and more two data centralization occurrence numbers, the discrimination of this word is lower so.Therefore, calculating P (V _i︱ A) time, we introduce weight λ _i, therefore formula (1), (2) can be updated to:

P(A︱V _1-n) = （3）

P(B︱V _1-n) = （4）

Weight λ _ineeding can the discrimination of word in whole document.The discrimination is here more to emphasize that a word can be because of not occurring that too frequent or occurrence number makes result become extremely very little, and therefore the punishment to frequency is more tended in the design of weight.Next, we are with the word V in spam ₁for example, inquire into the design of weight.

Here, our core adopts the mode of reverse word frequency, makes a certain proportion of punishment for word frequency.Such as word V ₁frequency in overall spam is CountsV _ia, so word V _ireverse word frequency be CountsV _ia/CountsV _i, i.e. frequency (the CountsV of this word in spam _i/ CountsV _ia, the i.e. prior probability of this word in rubbish) inverse, this word penalty coefficient in spam is log (CountsV so _ia/CountsV _i).The object of taking the logarithm is here to eliminate the excessive interference (if do not taken the logarithm, the product of penalty coefficient and prior probability be 1) of this punishment to prior probability.Equally, the frequency that we occur in entirety for this word is made punishment, so word V _ireverse word frequency be (CountsV _ia+CountsV _ib)/CountsV _i, the penalty coefficient of this word in all mails is log ((CountsV so _ia+CountsV _ib)/CountsV _i).

Therefore, formula (3), (4) are upgraded as follows:

P(A︱V _1-n) = （5）

P(B︱V _1-n) = （6）

Finally obtain our final posterior probability formula formula (5) (6).

Result is shown

We are downloading data collection from CSDMC2010 SPAM data bank, and raw data set one has 4327 envelope mails, wherein spam 1378 envelopes, non-spam 4327 envelopes.In the situation that seed=123 is set, we generate 1-4327 random series.For the random series obtaining, we select respectively 1000,2000,4327 envelope mails as sample number, carry out training and testing.Wherein, each training set accounts for total sample number 70%, and test set accounts for 30%.We adopt respectively primal algorithm (Naive Bayes Classification Model) and improve algorithm (the Bayes's text classifier based on reverse word frequency), and result contrast is as follows:

By above Data Comparison, we find to adopt and improve algorithm, and no matter at less sample or in compared with multisample situation, performance all exceedes primal algorithm (Naive Bayes Classifier).

Its processing and fabricating of Bayes's text classifier based on reverse word frequency of the present invention is very simple and convenient, can process shown in to specifications.

Except the technical characterictic described in instructions, be the known technology of those skilled in the art.

Claims

1. the Bayes's text classifier based on reverse word frequency, is characterized in that the discrimination using the reverse word frequency of word as word, and weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;