CN104142997A - Bayes text classifier based on reverse word frequency - Google Patents

Bayes text classifier based on reverse word frequency Download PDF

Info

Publication number
CN104142997A
CN104142997A CN201410376416.3A CN201410376416A CN104142997A CN 104142997 A CN104142997 A CN 104142997A CN 201410376416 A CN201410376416 A CN 201410376416A CN 104142997 A CN104142997 A CN 104142997A
Authority
CN
China
Prior art keywords
word
classification
probability
words
bayes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410376416.3A
Other languages
Chinese (zh)
Inventor
关丹辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410376416.3A priority Critical patent/CN104142997A/en
Publication of CN104142997A publication Critical patent/CN104142997A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Bayes text classifier based on the reverse word frequency. The reverse word frequency of words serves as the word distinction degree, the frequency of the words in different classes is weighted to obtain the comprehensive prior probability of the words; according to the Bayes theorem, posterior probabilities of the words belonging to different classes are calculated according to the prior probabilities, classification with the maximum posterior probability value is selected, and therefore the classification purpose is achieved. Compared with the prior art, according to the Bayes text classifier based on the reverse word frequency, at first, it is assumed that the occurrence probabilities of the words are independent from one another, the prior probability of each word is estimated according to a training data set, and therefore the posterior probabilities of the words appearing in testing documents and belonging to different class categories are calculated. The documents are classified into the specific classes on the basis of the maximum posterior probability value. The Bayes text classifier based on the reverse word frequency has the advantages of being reasonable in design, simple in structure, convenient to use and the like, thereby having the good using value.

Description

Bayes's text classifier based on reverse word frequency
Technical field
The present invention relates to information science and machine learning field, specifically a kind of Bayes's text classifier based on reverse word frequency.
Background technology
Current, data age arrives to be approved by industry gradually greatly, and large market demand also lands gradually.And at large data age, it is very powerful and exceedingly arrogant that the science such as data analysis, data mining and machine learning become, become the sharp weapon of large data age Denver Nuggets.Along with the surge of data volume, especially the obvious rising of text data, have increasing information accumulation, and the people who needs information also not special instrument easily go to extract suit the requirements succinct, refining, intelligible knowledge from the large-scale text message resource of multi-data source.The complicacy of text data and many scenes are used, and make text classification seem extremely important.No matter be news polymerization, Spam Classification, or microblogging content analysis, text classification all will be played the part of important role.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, a kind of Bayes's text classifier based on reverse word frequency is provided.
Text classification is very important research field in data mining and machine learning, and the target of text classification is that new document is marked with to suitable class label.First the process of text automatic classification is that the content of document in training set is analyzed, and constructs classification schemes, i.e. a sorter.After sorter study, each class has a different classification schemes, and available these classification schemes are to new document classification.
Naive Bayes Classifier is the Bayesian simple probability sorter of a kind of application based on independent hypothesis.The basis of Bayes's classification is probability inference, uncertain existing of various conditions exactly, only knows in the situation of its probability of occurrence, how to complete reasoning and decision task.In text classification, first we suppose that the probability of each word appearance is separate (although in real life, each word is also not exclusively separate, but Naive Bayes Classification effect is still very effective), estimate the prior probability of each word according to training dataset, calculate thus each word occurring in test document after, the posterior probability of the stepping that belongs to a different category.We are according to maximum a posteriori probability value, by document classification in concrete classification.
Technical scheme of the present invention realizes in the following manner, and its feature is the discrimination using the reverse word frequency of word as word, and weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;
Main consider aspect two of number of times that number of times that this word occurs in all documents and this word occur in this document; And according to law of great numbers, word Prior Probability represents at this classification medium frequency with word, and we have considered the discrimination of various words, in other words the prior probability now obtaining is not the probability that has purely represented that this word occurs in this classification, but has represented the comprehensive prior probability that occurs this word in this classification.
Advantage of the present invention is:
Bayes's text classifier based on reverse word frequency of the present invention compared to the prior art, first we suppose that the probability of each word appearance is separate, estimate the prior probability of each word according to training dataset, calculate thus each word occurring in test document after, the posterior probability of the stepping that belongs to a different category.We are according to maximum a posteriori probability value, by document classification in concrete classification, the feature such as it is reasonable in design, simple in structure, easy to use that the present invention has, thereby, there is good use value.
Embodiment
Below the Bayes's text classifier based on reverse word frequency of the present invention is described in detail below.
Bayes's text classifier based on reverse word frequency of the present invention, its feature is the discrimination using the reverse word frequency of word as word, weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;
Main consider aspect two of number of times that number of times that this word occurs in all documents and this word occur in this document; And according to law of great numbers, word Prior Probability represents at this classification medium frequency with word, and we have considered the discrimination of various words, in other words the prior probability now obtaining is not the probability that has purely represented that this word occurs in this classification, but has represented the comprehensive prior probability that occurs this word in this classification.
Indicate explanation
Here, we are taking Spam Classification as example.Suppose that category-A is spam, category-B is non-spam, V irepresent each word, thus, we make following sign:
Nums represents that total sample number Counts represents total word number
NumsA represents that spam number NumsB represents non-spam number
CountsA represents that in spam, total words SumB represents non-spam total words
CountsV ia represent word V ioccurrence number in spam
CountsV ib represent word V ioccurrence number in non-spam
P (A)=NumsA/Nums represents the probability that an envelope mail is spam
P (B)=NumsB/Nums represents the probability that an envelope mail is non-spam
P (V i) represent to occur in all documents word V iprobability
P (V i︱ A)=CountsV ia/ CountsA represents word V ithe probability occurring in spam
P (V i︱ B)=CountsV ia/ CountsA represents word V ithe probability occurring in non-spam
P (A ︱ V i) be illustrated in word V iwhen appearance, the probability that mail is spam
P (B ︱ V i) be illustrated in word V iwhen appearance, the probability that mail is non-spam
Bayes' theorem
According to Bayes' theorem, can obtain:
P(A︱V i) = P(V i︱A)* P(A)/ P(V i)
P(B︱V i) = P(V i︱B)* P(B)/ P(V i)
For the text of multiple word compositions:
P(A︱V 1-n) = P(V 1-n︱A)* P(A)/ P(V 1-n)
According to separate between each word of hypothesis, P (V so 1-n︱ A) * P (A)= therefore,
P(A︱V 1-n) = (1)
In like manner can obtain:
P(B︱V 1-n) = (2)
Therefore,, according to the thought of naive Bayesian, the frequency that need to occur in spam and non-spam according to each word, represents P (V i︱ A) or P (V i︱ B).And calculate the probability that occurs belonging to after these words spam and non-spam according to formula (1), (2).If P is (A ︱ V 1-n) > P (B ︱ V 1-n) this mail belong to spam, otherwise belong to non-spam.
Relatively formula (1), (2), we find, P (A ︱ V 1-n) and P (B ︱ V 1-n) molecule be all identical, so in actual computation, only need to calculate bulk of molecule and just can mutually compare.I.e. (Σ P (V i︱ A)) * P (A) > (Σ P (V i︱ B)) * P (B) is spam, otherwise is non-spam.
Based on the weight design of reverse word frequency
But at this, we consider: it is more frequent that word occurs in overall, and more two data centralization occurrence numbers, the discrimination of this word is lower so.Therefore, calculating P (V i︱ A) time, we introduce weight λ i, therefore formula (1), (2) can be updated to:
P(A︱V 1-n) = (3)
P(B︱V 1-n) = (4)
Weight λ ineeding can the discrimination of word in whole document.The discrimination is here more to emphasize that a word can be because of not occurring that too frequent or occurrence number makes result become extremely very little, and therefore the punishment to frequency is more tended in the design of weight.Next, we are with the word V in spam 1for example, inquire into the design of weight.
Here, our core adopts the mode of reverse word frequency, makes a certain proportion of punishment for word frequency.Such as word V 1frequency in overall spam is CountsV ia, so word V ireverse word frequency be CountsV ia/CountsV i, i.e. frequency (the CountsV of this word in spam i/ CountsV ia, the i.e. prior probability of this word in rubbish) inverse, this word penalty coefficient in spam is log (CountsV so ia/CountsV i).The object of taking the logarithm is here to eliminate the excessive interference (if do not taken the logarithm, the product of penalty coefficient and prior probability be 1) of this punishment to prior probability.Equally, the frequency that we occur in entirety for this word is made punishment, so word V ireverse word frequency be (CountsV ia+CountsV ib)/CountsV i, the penalty coefficient of this word in all mails is log ((CountsV so ia+CountsV ib)/CountsV i).
Therefore, formula (3), (4) are upgraded as follows:
P(A︱V 1-n) = (5)
P(B︱V 1-n) = (6)
Finally obtain our final posterior probability formula formula (5) (6).
Result is shown
We are downloading data collection from CSDMC2010 SPAM data bank, and raw data set one has 4327 envelope mails, wherein spam 1378 envelopes, non-spam 4327 envelopes.In the situation that seed=123 is set, we generate 1-4327 random series.For the random series obtaining, we select respectively 1000,2000,4327 envelope mails as sample number, carry out training and testing.Wherein, each training set accounts for total sample number 70%, and test set accounts for 30%.We adopt respectively primal algorithm (Naive Bayes Classification Model) and improve algorithm (the Bayes's text classifier based on reverse word frequency), and result contrast is as follows:
By above Data Comparison, we find to adopt and improve algorithm, and no matter at less sample or in compared with multisample situation, performance all exceedes primal algorithm (Naive Bayes Classifier).
Its processing and fabricating of Bayes's text classifier based on reverse word frequency of the present invention is very simple and convenient, can process shown in to specifications.
Except the technical characterictic described in instructions, be the known technology of those skilled in the art.

Claims (1)

1. the Bayes's text classifier based on reverse word frequency, is characterized in that the discrimination using the reverse word frequency of word as word, and weighting word obtains the comprehensive prior probability of word at different classes of medium frequency; According to Bayes' theorem, try to achieve by prior probability the posterior probability that these words belong to a different category, select the classification of maximum a posteriori probability value to reach the object of classification;
Main consider aspect two of number of times that number of times that this word occurs in all documents and this word occur in this document; And according to law of great numbers, word Prior Probability represents at this classification medium frequency with word, and we have considered the discrimination of various words, in other words the prior probability now obtaining is not the probability that has purely represented that this word occurs in this classification, but has represented the comprehensive prior probability that occurs this word in this classification.
CN201410376416.3A 2014-08-01 2014-08-01 Bayes text classifier based on reverse word frequency Pending CN104142997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410376416.3A CN104142997A (en) 2014-08-01 2014-08-01 Bayes text classifier based on reverse word frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410376416.3A CN104142997A (en) 2014-08-01 2014-08-01 Bayes text classifier based on reverse word frequency

Publications (1)

Publication Number Publication Date
CN104142997A true CN104142997A (en) 2014-11-12

Family

ID=51852171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410376416.3A Pending CN104142997A (en) 2014-08-01 2014-08-01 Bayes text classifier based on reverse word frequency

Country Status (1)

Country Link
CN (1) CN104142997A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022812A (en) * 2015-07-08 2015-11-04 中国地质大学(武汉) Document length based instance weighting method and text classifying method
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian
CN107889068A (en) * 2017-12-11 2018-04-06 成都欧督系统科技有限公司 Message broadcast controlling method based on radio communication
CN108108348A (en) * 2017-11-17 2018-06-01 腾讯科技(成都)有限公司 Processing method, server, storage medium and the electronic device of information

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022812A (en) * 2015-07-08 2015-11-04 中国地质大学(武汉) Document length based instance weighting method and text classifying method
CN105022812B (en) * 2015-07-08 2018-10-19 中国地质大学(武汉) A kind of example method of weighting and file classification method based on Document Length
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian
CN107391772B (en) * 2017-09-15 2020-12-01 国网四川省电力公司眉山供电公司 Text classification method based on naive Bayes
CN108108348A (en) * 2017-11-17 2018-06-01 腾讯科技(成都)有限公司 Processing method, server, storage medium and the electronic device of information
CN107889068A (en) * 2017-12-11 2018-04-06 成都欧督系统科技有限公司 Message broadcast controlling method based on radio communication

Similar Documents

Publication Publication Date Title
Agnihotri et al. Variable global feature selection scheme for automatic classification of text documents
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN103324628A (en) Industry classification method and system for text publishing
CN105224695A (en) A kind of text feature quantization method based on information entropy and device and file classification method and device
CN102289522B (en) Method of intelligently classifying texts
CN104391835A (en) Method and device for selecting feature words in texts
CN105447505B (en) A kind of multi-level important email detection method
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN103617157A (en) Text similarity calculation method based on semantics
CN104142997A (en) Bayes text classifier based on reverse word frequency
CN103020122A (en) Transfer learning method based on semi-supervised clustering
CN101540017A (en) Feature extraction method based on byte level n-gram and junk mail filter
Almeida et al. Facing the spammers: A very effective approach to avoid junk e-mails
CN104346459A (en) Text classification feature selecting method based on term frequency and chi-square statistics
CN104050556A (en) Feature selection method and detection method of junk mails
CN104820702B (en) A kind of attribute weight method and file classification method based on decision tree
CN105306296A (en) Data filter processing method based on LTE (Long Term Evolution) signaling
CN102567529B (en) Cross-language text classification method based on two-view active learning technology
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN106384123A (en) Feature weighting filter method based on correlation and Naive Bayes classification method
CN105117466A (en) Internet information screening system and method
CN104809229A (en) Method and system for extracting text characteristic words
Singh et al. Spam mail detection using classification techniques and global training set
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141112