CN109344252B

CN109344252B - Microblog text classification method and system based on high-quality theme extension

Info

Publication number: CN109344252B
Application number: CN201811064231.3A
Authority: CN
Inventors: 张曦元; 孙福权
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2021-12-07
Anticipated expiration: 2038-09-12
Also published as: CN109344252A

Abstract

The invention provides a microblog text classification method and system based on high-quality theme extension. The method is suitable for feature expansion of texts in short text classification such as microblog and the like, and effective classification of the microblog can be realized. Using training set microblog data as input of an LDA model to obtain topic probability distribution and word probability distribution; extracting a high-quality theme by using the high-representation theme extracted by the information entropy according to the similarity of the theme; performing theme inference on the test set microblog; selecting high-quality theme feature words to perform feature expansion on the microblog texts; and carrying out classification prediction on the expanded microblog texts by using a support vector machine algorithm. The method is suitable for solving the problem of inaccurate text feature expansion caused by mixing of subject words when the subject model is used for expanding the microblog text features.

Description

Microblog text classification method and system based on high-quality theme extension

Technical Field

The invention relates to the technical field of text classification, in particular to a microblog text classification method and system based on high-quality theme extension.

Background

As one of emerging media, the microblog has hundreds of millions of user groups so far, and occupies a leading position in a Chinese social network platform. The microblog operation is simple, the content is updated rapidly, and the microblog content updating method has high research value. In the past decades, text classification has been studied more, but the effect of short text classification such as microblog is not ideal all the time. Aiming at short microblog texts and sparse features, words are filtered through word segmentation and stop word processing, and few features are reserved after feature selection, so that although the complexity of calculation is reduced, the accuracy of classification is obviously reduced, and therefore, the features of the microblog texts need to be expanded for better classification.

The LDA model is a three-layer Bayesian probability model formed by words, subjects and documents. Assuming that each document is composed of a plurality of implicit topics, potential topics are mined according to the co-occurrence relation among vocabularies, texts are represented as probability distribution of the topics, and the topics are represented as probability distribution composed of a series of words. The method for realizing feature expansion of the short text by using topic distribution is an effective way for improving short text classification, but not all topics trained by a topic model can completely express a topic content, the phenomena of topic mixing and topic ambiguity exist, and other inconsistent features can be introduced by directly carrying out short text expansion.

Disclosure of Invention

According to the technical problems existing in the microblog text expansion by using the topic model, the microblog text classification method and the microblog text classification system based on the high-quality topic expansion are provided. According to the method, the high-quality theme is effectively extracted, and the defect of poor classification effect caused by characteristic sparsity is effectively overcome after the method is used for microblog characteristic expansion.

The technical means adopted by the invention are as follows:

a microblog text classification method based on high-quality theme expansion comprises the following steps:

s1, performing data preprocessing on the microblog text, selecting characteristics, and constructing a training set and a test set through the preprocessed text;

s2, taking the preprocessed training set data as the input of an LDA model to obtain the probability distribution of the subjects of the training set data and the probability distribution of the subject words;

s3, applying the information entropy to probability distribution of the subject words to calculate subject entropy, and calculating relative entropy and average similarity of subjects, thereby calculating high-quality coefficients of the subjects, and setting a threshold value to screen out high-quality subjects;

s4, respectively carrying out theme division on the training set and the test set, dividing a subject word of the maximum probability value of each text in the high-quality theme to the theme through LDA model theme distribution, and adding the subject word serving as an expansion word to the text characteristics of the training set and the test set respectively;

s5, performing text representation on the expanded text by using a vector space model, calculating the weight of each feature word by using TF-IDF, converting training data and test data documents into vectors, selecting useful features, training a training set by using a classifier SVM, and performing classification prediction on the test set to generate a classification result.

Further, the step of performing data preprocessing and feature selection on the microblog text comprises the following steps:

s11, carrying out Chinese word segmentation pretreatment on the text, and dividing the complete sentence into words to obtain a text corpus characteristic set;

s12, removing common conjunctions and pronouns stop words in the text after word segmentation, performing preprocessing operation by using a Chinese stop word list, deleting the feature words if the feature words have the stop word list, and then removing punctuation marks;

and S13, dividing the preprocessed text according to the category to construct a dictionary, counting the information of words in different categories, performing descending order arrangement on the total occurrence times of the feature words, selecting the words in the first n categories as the feature words of the category, and summarizing the words to be used as the general feature of the category.

Further, in step S2, the probability distribution of the topics of the training set data is obtained through the following steps:

s21, setting a theme model parameter alpha and a theme number K, and extracting a theme distribution doc-topic matrix theta of the microblog from Dirichlet distribution with the parameter alpha_m，θ_m～Dir(α),m∈[1,M]，θ_mRepresenting a topic probability distribution of a document m

Wherein n is_m,kAnd the number of the k-th subject words of the mth microblog is represented.

Further, in step S2, the probability distribution of the subject word of the training set data is obtained by the following steps:

s22, setting a theme model parameter beta and a theme number K, and extracting a word distribution topic-word matrix of a theme from Dirichlet distribution with the parameter beta

Probability distribution of words representing topic k

Wherein n is_k,vIndicating the number of occurrences of the word v under the topic k.

Further, the step S3 specifically includes:

s31, calculating a topic information entropy TE, specifically:

TE(k)＝-∑P(w|k)*lnP(w|k)

wherein P (w | k) represents the probability of the word w appearing under the topic k;

s32, calculating the relative entropy of the theme, specifically:

p, Q represents the distribution of the metric, when the two random distributions are the same, the relative entropy is zero, and when the difference between the two random distributions is increased, the relative entropy is also increased;

s33, calculating the average similarity of the subjects, specifically:

calculating the JS distance of the theme by using the relative entropy for measuring the similarity between the themes, which specifically comprises the following steps:

the average similarity is used for calculating the independence of a certain distribution relative to other distributions, and the calculation method of the average similarity of the subjects specifically comprises the following steps:

where j is not equal to K

Wherein K represents the total number of topics;

s34, screening high-quality subjects

Calculating a theme high-quality coefficient according to the theme entropy and the average similarity, wherein the calculation method specifically comprises the following steps:

and if the theme high-quality coefficient meets G (k) > mu and mu is a threshold value, judging that the theme belongs to a high-quality theme as an expansion alternative, otherwise, judging that the theme is not the high-quality theme, and further obtaining a high-quality theme set S.

Further, in step S4, the theme partitioning performed on the training set specifically includes:

s41, selecting the theme to which the probability maximum value belongs in the high-quality theme of each microblog according to theme distribution obtained by the theme model trained by the training set, and selecting lambda feature words w with the highest probability rank from the theme words corresponding to the theme to which the probability maximum value belongs { w ═ w }₁,w₂,…w_λAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.

Further, in step S4, the theme inference and feature expansion performed on the test set specifically include:

s42, carrying out theme inference on the test set by using the theme model trained by the training set to obtain a document-theme distribution matrix of the test text; selecting lambda characteristic words w with highest probability ranking in the topic selection probability of the maximum probability in the high-quality topic set S for each test text₁,w₂,…w_λAnd the words are added as expansion words to the text features of the test set.

Further, the step S5 is specifically:

s51, the expanded text obtained in step S41 is expressed as a text using a vector space model, document d is regarded as an n-dimensional vector in a vector space, and the weight of the feature is calculated using TF-IDF, where (e) is the vector v₁,ε₂,…,ε_n),ε_iThe weight of the ith word is represented, and the calculation of the weight specifically comprises the following steps:

wherein, tf_ijRefers to the frequency of occurrence of a feature word in a text, df_iRepresenting the number of texts containing the characteristic words in the corpus, wherein M is the total number of texts in the corpus;

s52, text classification is carried out by using an LIBSVM tool, and the data format of document conversion is label 1: value 2: value …, wherein label is a category identifier, and 1 and 2 are feature values, namely tf-idf calculation weights;

and S53, recording a training set type label Y ═ Y1, Y2, … and yn, and classifying and predicting the test set after training the model in the training set.

The invention also provides a microblog text classification system based on high-quality theme extension, which comprises the following steps:

the system comprises a text acquisition unit, a test unit and a processing unit, wherein the text acquisition unit is used for acquiring self-acquired microblog text data and constructing a training set and a test set;

the text data preprocessing unit is used for preprocessing an original text sample and selecting features, and comprises the following steps:

a Chinese word segmentation module for dividing the complete sentence into words and eliminating stop words in the text,

a Chinese inactive word list module for deleting the feature words in the inactive word list appearing in the text and eliminating punctuation marks,

the dictionary building module is used for sequencing the characteristic words in the text and summarizing the characteristic words;

the LDA model training unit is used for obtaining document theme distribution and theme word distribution conditions through training set data, and comprises:

the data processing module is used for calculating a high-quality coefficient through the distribution data of the theme words and dividing a high-quality theme through a set threshold;

the LDA model training unit is also used for taking the high-quality characteristic words as text extension of a training set and text extension of a test set;

and the text classification unit is used for performing text classification on the training set after the text expansion through an LIBSVM tool, and classifying the data to be tested of the test set to generate a classification result.

Compared with the prior art, the invention has the following advantages:

according to the method, the high-quality theme is effectively extracted through the microblog text classification method based on the high-quality theme expansion, the defect of poor classification effect caused by sparse features is effectively overcome after microblog feature expansion, compared with the prior art, the method is higher in accuracy rate, more suitable for feature expansion of texts in short text classification such as microblog and the like, and effective classification of the microblog can be achieved. The problem that text feature expansion is not accurate due to mixing of subject words when the subject model is used for expanding microblog text features is effectively solved.

Based on the reasons, the method can be widely popularized in the technical field of text classification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a microblog text classification method based on high-quality topic expansion.

FIG. 2 is an LDA probability model of the microblog text classification method based on the high-quality topic expansion.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the invention provides a microblog text classification method based on high-quality topic expansion, which comprises the following steps:

s3, calculating a topic entropy through the probability distribution of applying the information entropy to the topic words, calculating the relative entropy and the average similarity of the topics, thereby calculating a topic quality coefficient, and setting a threshold value to screen out a high quality topic;

The method for preprocessing the microblog text and selecting the characteristics comprises the following steps:

As shown in fig. 2, theme model parameters α and β and a theme number K are set, and parameter estimation is performed by using Gibbs sampling.

Probability distribution of words representing topic k

S31, applying the information entropy to the theme distribution to calculate the theme entropy and further divide the high-quality theme, and calculating the theme information entropy TE, specifically:

TE(k)＝-∑P(w|k)*lnP(w|k)

wherein, P (w | k) represents the probability of the word w appearing under the theme k, and the smaller the value of TE, the more different the distribution difference. From each topic, a small number of characteristic words appear with a high probability in the distribution, and other words appear with a low probability, so that the topic representation is strong and the topic noise is low.

S32, calculating the relative entropy of the theme, wherein the relative entropy is an index for measuring the difference between probability distributions, and specifically comprises the following steps:

s33, calculating the average similarity of the subjects, specifically:

where j is not equal to K

Wherein K represents the total number of topics;

s34, screening high-quality subjects

S41, selecting the theme distribution obtained by the theme model trained by the training setIn the high-quality topics where each microblog is located, the topic word corresponding to the topic belongs to the topic, and the lambda feature words w with the highest probability rank are selected as { w ═ w₁,w₂,…w_λAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.

Compared with the single SVM method and the LDA combined SVM method, the microblog text classification method based on the high-quality theme expansion obviously improves the accuracy rate through experimental verification, and is shown in Table 1:

TABLE 1

Feature(s)	Recall rate	Rate of accuracy
			SVM	0.754	0.760
LDA+SVM	0.831	0.822
			High-quality theme + SVM	0.863	0.857

A microblog text classification system based on high-quality theme extension comprises:

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A microblog text classification method based on high-quality theme expansion is characterized by comprising the following steps of:

s4, respectively carrying out theme division on the training set and the test set, dividing a subject word of each text in a theme corresponding to the probability maximum value in the high-quality theme through LDA model theme distribution, and adding the subject word serving as an expansion word into text features of the training set and the test set respectively;

s5, performing text representation on the expanded text by using a vector space model, calculating the weight of each feature word by using TF-IDF, converting training data and test data documents into vectors, selecting useful features, training a training set by using a classifier SVM, and performing classification prediction on the test set to generate a classification result;

in step S2, the probability distribution of the topics of the training set data is obtained by:

Wherein n is_m,kThe number of the k-th subject words of the mth microblog is represented;

obtaining the probability distribution of the subject term of the training set data by the following steps:

k∈[1,K]，

Probability distribution of words representing topic k

Wherein n is_k,vRepresenting the number of times the word v appears under the topic k;

the step S3 specifically includes:

s31, calculating a topic information entropy TE, specifically:

TE(k)＝-∑P(w|k)*lnP(w|k)

s32, calculating the relative entropy of the theme, specifically:

s33, calculating the average similarity of the subjects, specifically:

where j is not equal to K

Wherein K represents the total number of topics;

s34, screening high-quality subjects

2. The microblog text classification method based on the high-quality theme extension according to claim 1, wherein the step of performing data preprocessing and feature selection on microblog texts comprises the following steps:

3. The microblog text classification method based on the high-quality theme extension according to claim 1, wherein in the step S4, theme division is specifically performed on the training set as follows:

s41, selecting the topic distribution obtained by the topic model trained by the training set, and selecting the highest probability in the high-quality topics of each microblogThe subject to which the large value belongs, lambda feature words w with the highest probability of being selected from the subject words corresponding to the subject to which the large value belongs are arranged as { w ═ w }₁,w₂,…w_λAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.

4. The microblog text classification method based on high-quality topic expansion according to claim 3, wherein in the step S4, the topic inference and feature expansion of the test set specifically comprise:

5. The microblog text classification method based on the high-quality subject expansion according to claim 1, wherein the step S5 specifically comprises:

s52, text classification is carried out by using an LIBSVM tool, and the data format of document conversion is label 1: value 2: value …, wherein label is a category identifier, 1 and 2 are feature values, namely TF-IDF calculation weight values;

6. The microblog text classification system based on the microblog text classification method based on the high-quality subject expansion according to any one of claims 1 to 5 is characterized by comprising the following steps of:

7. A storage medium comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 5.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to perform the method of any one of claims 1 to 5.