CN109165294A

CN109165294A - Short text classification method based on Bayesian classification

Info

Publication number: CN109165294A
Application number: CN201810951636.2A
Authority: CN
Inventors: 水新莹; 张宇光; 黄亚坤
Original assignee: Anhui Xunfei Intelligent Technology Co ltd
Current assignee: Anhui Xunfei Intelligent Technology Co ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-01-08
Anticipated expiration: 2038-08-21
Also published as: CN109165294B

Abstract

The invention discloses a short text classification method based on Bayesian classification, which relates to the field of smart cities and electronic government affairs, and comprises the following steps: (1) preprocessing data and labeling categories; (2) completing word segmentation and incremental feature vector extraction of short text data, and mainly comprising the following two core steps; (3) establishing a short text classification model based on Bayes; (4) dividing the processed data set into a training set and a testing set, carrying out classification model training, and carrying out model optimization according to the result of the training set; (5) according to the trained model, short text data of unknown classes are input, the probability that the current input text belongs to each class is output, the class with the highest probability is selected as the result of the final classification class, and the short text classification method based on the Bayesian classification can effectively, intelligently and automatically classify the short text content.

Description

A kind of short text classification method based on Bayes's classification

Technical field

The present invention relates to smart cities and E-Government field, and in particular to a kind of short essay one's duty based on Bayes's classification Class method.

Background technique:

With the development of mobile Internet and social networks, the rise of the social softwares such as microblogging, wechat, company and portion of government Door is also gradually established connection using social software, is linked up.Issue frequency height, short and small content of text is mobile social media Feature, the scale of short text content are also being skyrocketed through.In search engine, intelligent customer service and public sentiment monitoring field, short text It is studied emphasis.In face of so huge and constantly incremental netizen's quantity, from various phenomenon description, personal letter, In the imperfect text informations such as comment, useful information is extracted, seems particularly important to policymaker such as media, governments.Manually Therefore, how efficiently, intelligence processing huge large-scale short text classification such as extracts at the inefficiency, can not usually efficiently accomplish task, Can, important in inhibiting of the effective classification to the construction for promoting E-Government automatically be carried out to short text content.

For the technology of existing text classification mainly from the representative degree of keyword, i.e. popularity proposes the similar methods such as weight To carry out the design of core classification algorithm；For example, in existing literature " a kind of file classification method based on cluster word insertion ", it is main If k- mean algorithm is applied on the word vector of document, the cluster set of fixed size, the mass center of each cluster are obtained It is interpreted a super word insertion, each insertion word in text collection is assigned to nearest cluster centers.Each cluster Mass center be interpreted a super word insertion, each insertion word in text collection is assigned to nearest cluster centers.Often A text is represented as a super word insertion packet, calculates each super word and is embedded in the frequency in respective text, that is, obtains Obtain the type of text.

Above-mentioned short text classification method is analyzed it is found that the selection of keyword affects classifying quality, needs to consider keyword Quantity and feature popularity, and short text classification in, short text characteristic key words are few, during actual classification, close Keyword is difficult to the inherent meaning of effective expression short text, and being easy to produce a text, there are the results of multiple class categories；In addition, Semantic information in short text also affect classification as a result, and in the prior art extract characteristic key words method to long text Classification there is preferable effect, and short text is difficult to effectively classify

Such as application No. is CN201710216502.1 to disclose a kind of text classifier obtained for automatic marking corpus Method and text classifier, this method includes determining concept set, with general in the corresponding concept keyword set of each concept It reads keyword and match simultaneously automatic marking processing to un-annotated data text；For each concept, when the corresponding mark of the concept When amount of text meets threshold condition in note corpus text collection, then corresponding text classification mould is trained to the concept Type, obtains corresponding text classifier, finally obtains the text corresponding with the concept that all amount of text meet threshold condition Classifier set.This kind of algorithm structure has universality, can neatly change classification system, save calculating time and resource, And the present invention provides a small amount of initial corpus text, and automatic marking, without artificial mark, further save the time and Cost, but this kind of classification method is it is not disclosed how make the higher technical solution of its accuracy by independently training.

Such as application No. is CN201710882685.0 disclose a kind of method for establishing textual classification model and text classification, Device, method for building up include: acquisition training sample；Corresponding moment of a vector is obtained after carrying out word cutting to text based on entity dictionary Battle array；Using the vector matrix of text and the classification of text, the first disaggregated model of training and the second disaggregated model；In training process In, the loss function of textual classification model is obtained using the loss function of the first disaggregated model and the second disaggregated model, and utilize The loss function of textual classification model obtains by the first and second disaggregated model structures the first and second disaggregated model adjusting parameters At textual classification model.The method of text classification includes: to obtain text to be sorted；Text is cut based on entity dictionary The corresponding vector matrix of text is obtained after word；Vector matrix is inputted into textual classification model, according to the output of textual classification model, The classification results of the text are obtained, but this kind of classification method by independently training it is not disclosed how keep its accuracy higher Technical solution.

Summary of the invention

The purpose of the present invention is to provide a kind of short text classification methods based on Bayes's classification, to solve the prior art In caused above-mentioned defects.

A kind of short text classification method based on Bayes's classification, which is characterized in that this method includes following steps:

(1) data prediction and classification mark:

Step 1: the history short text data reported is extracted, and routine data cleaning, data integration etc. are carried out to data Reason improves the quality of data；

Step 2: to the data after preliminary cleaning are completed, the processed short text of history has been accomplished manually classification mark, to working as Preceding untreated partial data carries out artificial classification mark, completes process of data preprocessing；

(2) participle and increment feature vector for completing short text data extract, and are broadly divided into following two core procedure:

Step 1: the three-party library Jieba participle based on Python segments the short text content after cleaning；

Step 2: proposing increment feature vector, and TF-IDF combined to carry out keyword extraction, if keyword is very few, It directly uses and all segments phrase as final sorting parameter input；

(3) the short text disaggregated model based on Bayes is established；

(4) training set and test set are divided into processed data acquisition system, carry out disaggregated model training, and according to training The result of collection carries out the optimization of model；

(5) it according to trained model, the short text data of unknown input classification, exports currently to input text and belonging to The probability of each classification chooses result of the classification of maximum probability as final classification classification.

Preferably, the data prediction includes following four step:

Step 1: carrying out cleaning classification for initial data, and text is divided into three classifications using Kettle, is major class respectively Serial number, group serial number and text；

Step 2: the data handled well are stored in database；

Step 3: the content i.e. plain text of third field are segmented using Jiaba participle；

Step 4: the every row of the word divided is left by three word deposit databases according to part of speech.

Preferably, the increment feature vector sum TF-IDF Feature Words extraction method carry out the extraction of characteristic key words include with Lower two steps:

Step 1: note B=(B₁,B₂,...,B_u) be the Feature Words composition extracted from text feature vector, u herein Value be it is smaller, such as 3 or 4, such as rubbish reports information, there is a floating on water surface in the position of rubbish, and in greenbelt, road surface will The word for describing the distributing position of rubbish is summarised as a new Feature Words B_u+1, it gives and names, and so on, work as u=5, 6 ..., m just obtains increment feature vector B=(B₁,B₂,...,B_m)；

Step 2: if the TF high of some word or phrase frequency of occurrences in an article, and in other articles very It is few to occur, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify, the feature extraction of TF-IDF Function are as follows: f (w)=TF (w) × IDF (w) × log [N*n (w)+1] completes feature to short text content according to above-mentioned formula and closes Keyword extracts, firstly, the TF value of Feature Words w is denoted as^TF(w), often characteristic item frequency TF is risen in conjunction with anti-document frequency IDF To use；Then IDF (w)=log [N*n (w)+1] is calculated, N is text sum, and n (w) is the textual data comprising w.

Preferably, to the short text sample record of input, B=(B1, B2 ..., Bm) be extract feature vector, C1, C2 ..., Cn is n classification results；P (Ci | B), i=1,2 ..., n indicates that text to be sorted belongs to i-th of classification results Probability P (B_j| Ci), j=1,2 ..., m, i=1,2 ..., n indicate that j-th of Feature Words belongs to the probability of the i-th class, specifically counting It is lower shown based on Bayesian formula in calculation:

When classify new text when, it is only necessary to new sample is sentenced probability by the value for calculating P in n classification (Ci | B) It is worth in maximum class, wherein probability P (B) is the constant unrelated with classification, further according to feature vector B=(B₁,B²,...,B_m) each Independence between a Feature Words, above-mentioned calculation formula can simplify are as follows:

Preferably, according to the model of foundation, the classification ownership of unknown short text information is calculated, if N is the sample of prediction Sum, Cou (Cⁱ) indicate i-th of classification counting in the sample, then P (C_i)=Cou (C_i)/N, Cou (B_ij) indicate i-th point In class, the number of j-th of Feature Words, then P (B_j|C_i)=Cou (Bi_j)/Cou(C_i), belong to finally, calculating sample to be sorted The probability of each classification obtains maximum probability

The present invention has the advantages that the short text classification method based on Bayes's classification is somebody's turn to do, according to the short essay of reporting of user Classified after this content analysis and be distributed to service unit, for the short text assorting process of core, first to source data into Row data cleansing, the regular processing such as integrated, and extraction section short essay data are as training data, according to the demand of classification to extraction Data carry out classification annotation；Then, the three-party library Jieba participle based on Python divides the short text content after cleaning Word, and keyword is extracted based on TF-IDF, it is contemplated that short text content is few, and therefore, the keyword that TF-IDF is extracted is as pattra leaves Reference before this classification model construction, if the keyword extracted is very few, the phrase after the short text that then be used directly participle carries out classification and builds Mould establishes disaggregated model based on Bayesian formula, and adjust correlation model according to above-mentioned steps, until the precision of class test Until tending towards stability.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Fig. 2 is the flow chart of data processing in the present invention.

Specific embodiment

To be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, below with reference to Specific embodiment, the present invention is further explained.

As depicted in figs. 1 and 2, a kind of short text classification method based on Bayes's classification, which is characterized in that this method packet Include following steps:

(1) data prediction and classification mark:

(3) the short text disaggregated model based on Bayes is established；

It is worth noting that, the data prediction includes following four step:

Step 2: the data handled well are stored in database；

In the present embodiment, the increment feature vector sum TF-IDF Feature Words extraction method carries out the extraction of characteristic key words Including following two step:

Step 2: if the TF high of some word or phrase frequency of occurrences in an article, and in other articles very It is few to occur, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify, the feature extraction of TF-IDF Function are as follows: f (w)=TF (w) × IDF (w) × log [N*n (w)+1] completes feature to short text content according to above-mentioned formula and closes Keyword extracts, firstly, the TF value of Feature Words w is denoted as TF (w), often by characteristic item frequency TF in conjunction with anti-document frequency IDF Get up to use；Then IDF (w)=log [N*n (w)+1] is calculated, N is text sum, and n (w) is the textual data comprising w.

In the present embodiment, to the short text sample record of input, B=(B1, B₂,...,B_m) it is the feature vector extracted, C₁,C₂,...,C_nFor n classification results；P(C_i| B), i=1,2 ..., n indicates that text to be sorted belongs to i-th of classification results Probability P (B_j|C_i), j=1,2 ..., m, i=1,2 ..., n indicate that j-th of Feature Words belongs to the probability of the i-th class, specific It is lower shown based on Bayesian formula in calculating:

When the new text of classification, it is only necessary to calculate P (C in n classification_i| B) value, new sample is sentenced into probability It is worth in maximum class, wherein probability P (B) is the constant unrelated with classification, further according to feature vector B=(B₁,B₂,...,B_m) each Independence between a Feature Words, above-mentioned calculation formula can simplify are as follows:

In addition, according to the model of foundation, the classification ownership of unknown short text information is calculated, if N be that the sample predicted is total Number, Cou (C_i) indicate i-th of classification counting in the sample, then P (C_i)=Cou (C_i)/N, Cou (Bi_j) indicate i-th of classification In, the number of j-th of Feature Words, then P (B_j|C_i)=Cou (Bi_j)/Cou(C_i), belong to often finally, calculating sample to be sorted The probability of a classification obtains maximum probability

Based on above-mentioned, should short text classification method based on Bayes's classification, this method includes following steps: (1) number Data preprocess and classification mark；(2) participle and increment feature vector for completing short text data extract, and are broadly divided into following two Core procedure；(3) the short text disaggregated model based on Bayes is established；(4) training set is divided into processed data acquisition system And test set, disaggregated model training is carried out, and the optimization of model is carried out according to the result of training set；(5) according to trained Model, the short text data of unknown input classification export currently to input the probability that text belongs to each classification, choose probability most Big classification is as final classification classification as a result, being classified and being distributed to according to after the short text content analysis of reporting of user Service unit carries out data cleansing, the regular processing such as integrated to source data first for the short text assorting process of core, and Extraction section short essay data carry out classification annotation as training data, according to data of the demand of classification to extraction；Then, it is based on The three-party library Jieba participle of Python segments the short text content after cleaning, and extracts keyword based on TF-IDF, examines It is few to consider short text content, therefore, the keyword that TF-IDF is extracted is as the reference before Bayes's classification modeling, if the pass extracted Keyword is very few, and the phrase after the short text that then be used directly participle carries out classification model construction, according to above-mentioned steps, is based on Bayesian formula Disaggregated model is established, and adjusts correlation model, until the precision of class test tends towards stability.

As known by the technical knowledge, the present invention can pass through the embodiment party of other essence without departing from its spirit or essential feature Case is realized.Therefore, embodiment disclosed above, in all respects are merely illustrative, not the only.Institute Have within the scope of the present invention or is included in the invention in the change being equal in the scope of the present invention.

Claims

1. a kind of short text classification method based on Bayes's classification, which is characterized in that this method includes following steps:

(1) data prediction and classification mark:

Step 1: extracting the history short text data reported, and carry out routine data cleaning to data, and data integration etc. is handled, Improve the quality of data；

Step 2: to the data after preliminary cleaning are completed, the processed short text of history has been accomplished manually classification mark, to currently not The partial data of processing carries out artificial classification mark, completes process of data preprocessing；

Step 2: increment feature vector is proposed, and TF-IDF is combined to carry out keyword extraction, if keyword is very few, directly Use and all segments phrase as final sorting parameter input；

(3) the short text disaggregated model based on Bayes is established；

(4) training set and test set are divided into processed data acquisition system, carry out disaggregated model training, and according to training set As a result the optimization of model is carried out；

(5) according to trained model, the short text data of unknown input classification, export for currently input text belong to it is each The probability of classification chooses result of the classification of maximum probability as final classification classification.

2. a kind of short text classification method based on Bayes's classification according to claim 1, it is characterised in that: the number Data preprocess includes following four step:

Step 1: carrying out cleaning classification for initial data, and text is divided into three classifications using Kettle, is major class sequence respectively Number, group serial number and text；

Step 2: the data handled well are stored in database；

3. a kind of short text classification method based on Bayes's classification according to claim 1, it is characterised in that: the increasing The extraction that measure feature vector sum TF-IDF Feature Words extraction method carries out characteristic key words includes following two step:

Step 1: note B=(B₁,B₂,...,B_u) be the Feature Words composition extracted from text feature vector, the value of u herein Be it is smaller, such as 3 or 4, such as rubbish reports information, there is a floating on water surface in the position of rubbish, and in greenbelt, road surface will be described The word of the distributing position of rubbish is summarised as a new Feature Words B_u+1, it gives and names, and so on, work as u=5,6 ..., m Just increment feature vector B=(B is obtained₁,B₂,...,B_m)；

Step 2: if the TF high of some word or phrase frequency of occurrences in an article, and seldom go out in other articles It is existing, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify, the feature extraction function of TF-IDF Are as follows: f (w)=TF (w) × IDF (w) × log [N*n (w)+1] completes characteristic key words to short text content according to above-mentioned formula It extracts, firstly, the TF value of Feature Words w is denoted as TF (w), often combines characteristic item frequency TF with anti-document frequency IDF It uses；Then IDF (w)=log [N*n (w)+1] is calculated, N is text sum, and n (w) is the textual data comprising w.

4. a kind of short text classification method based on Bayes's classification according to claim 3, it is characterised in that: to input Short text sample record, B=(B₁,B₂,...,B_m) it is the feature vector extracted, C₁,C₂,...,C_nFor n classification results；P (C_i| B), i=1,2 ..., n indicates that text to be sorted belongs to the probability P (B of i-th of classification results_j|C_i), j=1,2 ..., m, I=1,2 ..., n indicate that j-th of Feature Words belongs to the probability of the i-th class, are lower institute based on Bayesian formula in specific calculate Show:

When the new text of classification, it is only necessary to calculate P (C in n classification_i| B) value, by new sample sentence probability value maximum Class in, wherein probability P (B) is the constant unrelated with classification, further according to feature vector B=(B₁,B₂,...,B_m) each feature Independence between word, above-mentioned calculation formula can simplify are as follows:

5. a kind of short text classification method based on Bayes's classification according to claim 1, it is characterised in that: according to building Vertical model calculates the classification ownership of unknown short text information, if N is the total sample number of prediction, Cou (C_i) indicate i-th Classification counting in the sample, then P (C_i)=Cou (C_i)/N, Cou (B_ij) indicate in i-th of classification, of j-th of Feature Words It counts, then P (B_j|C_i)=Cou (B_ij)/Cou(C_i), finally, calculating the probability that sample to be sorted belongs to each classification, obtain most Big probability