CN109165294B

CN109165294B - Short text classification method based on Bayesian classification

Info

Publication number: CN109165294B
Application number: CN201810951636.2A
Authority: CN
Inventors: 水新莹; 张宇光; 黄亚坤
Original assignee: Anhui Xunfei Intelligent Technology Co ltd
Current assignee: Anhui Xunfei Intelligent Technology Co ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2021-09-24
Anticipated expiration: 2038-08-21
Also published as: CN109165294A

Abstract

The invention discloses a short text classification method based on Bayesian classification, which relates to the field of smart cities and electronic government affairs, and comprises the following steps: (1) preprocessing data and labeling categories; (2) completing word segmentation and incremental feature vector extraction of short text data, and mainly comprising the following two core steps; (3) establishing a short text classification model based on Bayes; (4) dividing the processed data set into a training set and a testing set, carrying out classification model training, and carrying out model optimization according to the result of the training set; (5) according to the trained model, short text data of unknown classes are input, the probability that the current input text belongs to each class is output, the class with the highest probability is selected as the result of the final classification class, and the short text classification method based on the Bayesian classification can effectively, intelligently and automatically classify the short text content.

Description

Short text classification method based on Bayesian classification

Technical Field

The invention relates to the field of smart cities and electronic government affairs, in particular to a short text classification method based on Bayesian classification.

Background art:

with the development of mobile internet and social networks and the rise of social software such as microblogs and wechat, companies and government departments gradually use the social software to establish connection and communicate. The characteristics of high publishing frequency and short text content are mobile social media, and the scale of the short text content is also rapidly increased. Short texts are also the focus of research in the fields of search engines, intelligent customer service and public opinion monitoring. In the face of such a huge and increasing number of netizens, useful information is extracted from incomplete text information such as various phenomenon descriptions, private letters, comments and the like, and the method is very important for decision makers such as media, governments and the like. The manual processing of huge and large-scale short text classification and extraction equivalence rate is low, and tasks cannot be effectively completed usually, so that how to efficiently, intelligently and automatically effectively classify short text contents has important significance for promoting the construction of electronic government affairs.

The existing text classification technology mainly carries out the design of a core classification algorithm by using a similar method such as the representative degree of keywords, namely, the broad proposition of weights and the like; for example, in the existing document, "a text classification method based on cluster word embedding", a k-means algorithm is mainly applied to word vectors of a document to obtain a set of fixed-size clusters, a centroid of each cluster is interpreted as a hyperword embedding, and each embedded word in the text set is assigned to the nearest cluster center. The centroid of each cluster is interpreted as a hypernym embedding, and each embedded word in the text collection is assigned to the nearest cluster center. Each text is represented as a super word embedding package, and the frequency of embedding each super word in the respective text is calculated, namely the type of the text is obtained.

Analyzing the short text classification method, the selection of the keywords influences the classification effect, the number of the keywords and the feature universality need to be considered, in the short text classification, the short text feature keywords are few, in the actual classification process, the keywords are difficult to effectively express the intrinsic meaning of the short text, and the result that one text has a plurality of classification categories is easy to generate; in addition, semantic information in the short text also influences the classification result, while the method for extracting the feature keywords in the prior art has a good effect on the classification of the long text, and the short text is difficult to effectively classify

For example, CN201710216502.1 discloses a method for obtaining a text classifier for automatically labeling corpora and a text classifier, the method includes determining a concept set, matching and automatically labeling the text of the un-labeled corpora with the concept keywords in the concept keyword set corresponding to each concept; for each concept, when the text quantity in the labeled corpus text set corresponding to the concept meets the threshold condition, training a corresponding text classification model for the concept to obtain a corresponding text classifier, and finally obtaining a text classifier set corresponding to the concept, wherein the text classifier set meets the threshold condition in all text quantities. The algorithm structure has universality, can flexibly change a classification system, saves calculation time and resources, provides a small amount of initial corpus texts, automatically labels without manual labeling, and further saves time and cost, but the classification method does not disclose a technical scheme for improving the accuracy of the classification method through autonomous training.

For example, CN201710882685.0 discloses a method and an apparatus for establishing a text classification model and text classification, where the establishing method includes: obtaining a training sample; the method comprises the steps of obtaining a corresponding vector matrix after word segmentation is carried out on a text based on an entity dictionary; training a first classification model and a second classification model by using a vector matrix of the text and classification of the text; in the training process, a loss function of the text classification model is obtained by using the loss functions of the first classification model and the second classification model, and parameters of the first classification model and the second classification model are adjusted by using the loss functions of the text classification model, so that the text classification model formed by the first classification model and the second classification model is obtained. The text classification method comprises the following steps: acquiring a text to be classified; the method comprises the steps of obtaining a vector matrix corresponding to a text after the text is cut into words based on an entity dictionary; the vector matrix is input into the text classification model, and the classification result of the text is obtained according to the output of the text classification model, but the classification method does not disclose a technical scheme of how to make the accuracy of the text more accurate through autonomous training.

Disclosure of Invention

The invention aims to provide a short text classification method based on Bayesian classification to solve the defects in the prior art.

A short text classification method based on Bayesian classification is characterized by comprising the following steps:

(1) data preprocessing and category labeling:

the method comprises the following steps: extracting reported historical short text data, and performing conventional data cleaning and data integration processing on the data to improve the data quality;

step two: manually completing category labeling on the data after the preliminary cleaning is completed and on the historical processed short text, and manually labeling the category of the currently unprocessed partial data to complete the data preprocessing process;

(2) completing word segmentation and incremental feature vector extraction of short text data, comprising the following two core steps:

the method comprises the following steps: carrying out word segmentation on the cleaned short text content by using Python-based three-party library Jieba word segmentation;

step two: extracting the incremental characteristic vector, extracting key words by combining TF-IDF, and directly using all word segmentation phrases as final classification parameters to input if the number of the key words is too small;

(3) establishing a short text classification model based on Bayes;

(4) dividing the processed data set into a training set and a testing set, carrying out classification model training, and carrying out model optimization according to the result of the training set;

(5) according to the trained model, inputting short text data of unknown classes, outputting the probability that the current input text belongs to each class, and selecting the class with the maximum probability as a result of final classification of the classes.

Preferably, the data preprocessing comprises the following four steps:

the method comprises the following steps: cleaning and classifying the original data, and classifying the text into three categories, namely a major serial number, a minor serial number and the text by using a button;

step two: storing the processed data into a database;

step three: utilizing the Jianba word segmentation to segment the content of the third field, namely the plain text;

step four: and reserving three words in each row of the divided words according to the part of speech and storing the words in the database.

Preferably, the extracting of the feature key words by the incremental feature vector and the TF-IDF feature word extracting method comprises the following two steps:

the method comprises the following steps: let B ═ B₁,B₂,...,B_u) For the feature vector composed of the feature words extracted from the text, the words of the feature words describing the feature vector are summarized into a new feature word B_u+1Given a name, and so on, when u is 5,6, m obtains the incremental feature vector B (B)₁,B₂,...,B_m)；

Step two: if a word or phrase has high frequency of TF in one article and rarely occurs in other articles, the word or phrase is considered to have good classification capability and is suitable for classification, and the feature extraction function of TF-IDF is: f (w) ═ TF (w) xIDF (w), completing feature keyword extraction on short text content according to the formula, firstly, marking the TF value of the feature word w as TF (w), and often combining the feature term frequency TF with the inverse document frequency IDF for use; then, idf (w) ═ log [ N/N (w) +1], N being the total number of texts, and N (w) being the number of texts containing w, are calculated.

Preferably, for the input short text sample record, B ═ B (B)₁,B₂,...,B_m) For extracted feature vectors, C₁,C₂,...,C_nN classification results; p (C)_iI ═ 1, 2., n denotes the probability that the text to be classified belongs to the ith classification result; p (B)_j|C_i) J 1,2, a., m, i 1,2, a., n denotes a probability that the jth feature word belongs to the ith class, and in a specific calculation, the following is shown based on a bayesian formula:

when classifying new text, only P (C) of n classes needs to be calculated_i| B), determining new samples to the class with the highest probability value, wherein the probability p (B) is a constant independent of the class, and then determining the new samples according to the characteristic vector B (B)₁,B₂,...,B_m) The independence among all the characteristic words, the above calculation formula can be simplified as follows:

preferably, the category attribution of the unknown short text information is calculated according to the established model, and if N is the total number of the predicted samples, Cou (C)_i) Representing the count of the ith class in the sample, P (C)_i)＝Cou(C_i)/N，Cou(B_ij) Representing the number of the jth feature word in the ith classification, P (B)_j|C_i)＝Cou(B_ij)/Cou(C_i) Finally, calculating the probability of each class of the sample to be classified to obtain the maximum probability

The invention has the advantages that: the short text classification method based on Bayesian classification is characterized in that classification is carried out after analysis according to short text content reported by a user and distributed to business units, for the core short text classification process, data cleaning, regularization integration and other processing are carried out on source data, part of short text data are extracted as training data, and classification and labeling are carried out on the extracted data according to classification requirements; then, the cleaned short text content is segmented by the aid of Python-based three-party library Jieba segmentation, keywords are extracted based on TF-IDF, and the short text content is considered to be small, so that the keywords extracted by the TF-IDF serve as references before Bayesian classification modeling, if the extracted keywords are too small, phrases after the short text segmentation are directly used for classification modeling, a classification model is built based on a Bayesian formula according to the steps, and relevant models are adjusted until the classification test precision is stable.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of data processing in the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

As shown in fig. 1 and fig. 2, a short text classification method based on bayesian classification is characterized in that the method comprises the following steps:

(1) data preprocessing and category labeling:

the method comprises the following steps: the reported historical short text data is extracted, and the data is processed by conventional data cleaning, data integration and the like, so that the data quality is improved;

(2) completing word segmentation and incremental feature vector extraction of short text data, and mainly comprising the following two core steps:

(3) establishing a short text classification model based on Bayes;

It is noted that the data preprocessing comprises the following four steps:

step two: storing the processed data into a database;

In this embodiment, the extracting of the feature keyword by the incremental feature vector and TF-IDF feature word extracting method includes the following two steps:

the method comprises the following steps:let B ═ B₁,B₂,...,B_u) For the feature vector composed of the feature words extracted from the text, the words of the feature words describing the feature vector are summarized into a new feature word B_u+1Given a name, and so on, when u is 5,6, m obtains the incremental feature vector B (B)₁,B₂,...,B_m)；

In the present embodiment, for the input short text sample record, B ═ B (B)₁,B₂,...,B_m) For extracted feature vectors, C₁,C₂,...,C_nN classification results; p (C)_iI ═ 1, 2., n denotes the probability that the text to be classified belongs to the ith classification result; p (B)_j|C_i) J 1,2, a., m, i 1,2, a., n denotes a probability that the jth feature word belongs to the ith class, and in a specific calculation, the following is shown based on a bayesian formula:

in addition, according to the established model, the category attribution of the unknown short text information is calculated, and if N is the total number of the predicted samples, Cou (C)_i) Representing the count of the ith class in the sample, P (C)_i)＝Cou(C_i)/N，Cou(B_ij) Representing the number of the jth feature word in the ith classification, P (B)_j|C_i)＝Cou(B_ij)/Cou(C_i) Finally, calculating the probability of each class of the sample to be classified to obtain the maximum probability

Based on the foregoing, the short text classification method based on bayesian classification includes the following steps: (1) preprocessing data and labeling categories; (2) completing word segmentation and incremental feature vector extraction of short text data, and mainly comprising the following two core steps; (3) establishing a short text classification model based on Bayes; (4) dividing the processed data set into a training set and a testing set, carrying out classification model training, and carrying out model optimization according to the result of the training set; (5) inputting short text data of unknown classes according to a trained model, outputting the probability that the current input text belongs to each class, selecting the class with the highest probability as a final classification result, classifying and distributing the class to a service unit according to the short text content analysis reported by a user, firstly performing data cleaning, regularizing integration and other processing on source data in the core short text classification process, extracting partial short text data as training data, and classifying and labeling the extracted data according to the classification requirement; then, the cleaned short text content is segmented by the aid of Python-based three-party library Jieba segmentation, keywords are extracted based on TF-IDF, and the short text content is considered to be small, so that the keywords extracted by the TF-IDF serve as references before Bayesian classification modeling, if the extracted keywords are too small, phrases after the short text segmentation are directly used for classification modeling, a classification model is built based on a Bayesian formula according to the steps, and relevant models are adjusted until the classification test precision is stable.

It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.

Claims

1. A short text classification method based on Bayesian classification is characterized by comprising the following steps:

(1) data preprocessing and category labeling:

(3) establishing a short text classification model based on Bayes;

(5) inputting short text data of unknown classes according to the trained model, outputting the probability that the current input text belongs to each class, and selecting the class with the maximum probability as a result of final classification of the classes;

the data preprocessing comprises the following four steps:

step two: storing the processed data into a database;

step four: storing three words left in each row of the divided words into a database according to the part of speech;

the method for extracting the feature key words by the incremental feature vector and TF-IDF feature word extraction method comprises the following two steps:

Step two: if a word or phrase has high frequency of TF in one article and rarely occurs in other articles, the word or phrase is considered to have good classification capability and is suitable for classification, and the feature extraction function of TF-IDF is: f (w) ═ TF (w) x IDF (w), extracting the feature key words from the short text content according to the formula, firstly, marking the TF value of the feature word w as TF (w), and combining the feature term frequency TF with the inverse document frequency IDF for use; then, idf (w) ═ log [ N/N (w) +1], N being the total number of texts, and N (w) being the number of texts containing w, are calculated.

2. The Bayesian classification-based short text classification method according to claim 1, wherein: for the input short text sample record, B ═ B₁,B₂,...,B_m) For extracted feature vectors, C₁,C₂,...,C_nN classification results; p (C)_iI |, B), i ═ 1,2The probability of an outcome; p (B)_j|C_i) J 1,2, a., m, i 1,2, a., n denotes a probability that the jth feature word belongs to the ith class, and in a specific calculation, the following is shown based on a bayesian formula:

3. the Bayesian classification-based short text classification method according to claim 1, wherein: calculating the category attribution of the unknown short text information according to the established model, and if N is the total number of the predicted samples, Cou (C)_i) Representing the count of the ith class in the sample, P (C)_i)＝Cou(C_i)/N，Cou(B_ij) Representing the number of the jth feature word in the ith classification, P (B)_j|C_i)＝Cou(B_ij)/Cou(C_i) Finally, calculating the probability of each class of the sample to be classified to obtain the maximum probability