CN109344252A

CN109344252A - Microblogging file classification method and system based on high-quality topic expansion

Info

Publication number: CN109344252A
Application number: CN201811064231.3A
Authority: CN
Inventors: 张曦元; 孙福权
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2019-02-15
Anticipated expiration: 2038-09-12
Also published as: CN109344252B

Abstract

The present invention provides a kind of microblogging file classification method and system based on high-quality topic expansion.The feature extension of text, can be realized microblogging and effectively classifies suitable for the classification of the short texts such as microblogging.Using training set microblog data as the input of LDA model, obtains theme probability distribution and Word probability is distributed；High-quality theme is extracted according to the similarity of theme using the high characterization Sexual Themes that comentropy extracts；Theme deduction is carried out to test set microblogging；It chooses high-quality theme feature word and feature extension is carried out to microblogging text；Classification prediction is carried out using algorithm of support vector machine to the microblogging text after extension.This method is suitable for solving the problems, such as that it is inaccurate to mix caused text feature extension for descriptor when extending microblogging text feature using topic model.

Description

Microblogging file classification method and system based on high-quality topic expansion

Technical field

The present invention relates in Text Classification field, specifically, more particularly to a kind of based on high-quality topic expansion Microblogging file classification method and system.

Background technique

As one of emerging medium, microblogging has more than one hundred million user groups so far, and neck is occupied in Chinese social network-i i-platform First status.Microblogging is easy to operate, content update is rapid, researching value with higher.In past decades, text classification is There is more research, but the effect of short text this kind of for microblogging classification is undesirable always.It is short and small, special for microblogging text length Levy sparse, then handled through participle and stop words, filter out some words, again pass by after feature selecting be even more remain it is few Feature, although this reduces the complexity of calculating, make classification accuracy rate be decreased obviously so in order to preferably carry out Classification, needs to be extended feature to microblogging text.

LDA model is three layers of bayesian probability model being made of word, theme and document.Assuming that every document is by multiple Implicit theme is constituted, and excavates potential theme according to the cooccurrence relation between vocabulary, the probability distribution that text representation is the theme will be led Topic is expressed as a series of probability distribution that words are constituted.The feature extension that short text is realized using theme distribution is to promote short text The effective means of classification, but the theme that trains of topic model and not all theme can one subject content of complete expression, deposit Mix in theme and the indefinite phenomenon of theme, directly progress short text expansion may introduce other features that are not consistent.

Summary of the invention

The expansion of microblogging text is carried out using topic model according to set forth above, and a kind of base is provided In the microblogging file classification method and system of high-quality topic expansion.This method of the present invention effectively extracts good theme, is used for The poor disadvantage of the sparse caused classifying quality of feature is efficiently solved after the extension of microblogging feature.

The technological means that the present invention uses is as follows:

A kind of microblogging file classification method based on high-quality topic expansion, includes the following steps:

S1, data prediction is carried out to microblogging text and carries out feature selecting, constructed and trained by pretreated text Collection and test set；

S2, using pretreated training set data as the input of LDA model, obtain the probability of the theme of training set data The probability distribution of distribution and descriptor；

S3, the probability distribution that comentropy is applied to descriptor is calculated into theme entropy, while calculates relative entropy and theme Average similarity, to calculate theme merit figue, given threshold filters out high-quality theme；

S4, theme division is carried out to training set and test set respectively, each text is marked off by LDA model theme distribution Maximum probability value in high-quality theme is added separately to train for the descriptor of theme using the descriptor as expansion word In the text feature of collection and test set；

S5, text representation is carried out to the text after extension using vector space model, and calculates each feature with TF-IDF Training data and test data document are changed into vector, choose useful feature by the weight of word, are instructed by classifier SVM training Practice collection, classification prediction then is carried out to test set, generates classification results.

Further, described that microblogging text is carried out data prediction and carries out feature selecting to include the following steps:

S11, Chinese word segmentation pretreatment is carried out to text, complete sentence is divided into vocabulary, to obtain corpus of text Feature set；

Common conjunction, pronoun class stop words after S12, rejecting participle in text, are carried out pre- using Chinese stoplist Processing operation is deleted the specific word if Feature Words exist and deactivate vocabulary, then is rejected to punctuation mark；

S13, will pretreatment hereinafter this according to generic divide building dictionary, the information of different classes of word is counted, to spy There is total degree and carries out descending arrangement in sign word, and every class is selected to come Feature Words of the word as such of preceding n, conduct after being summarized The general characteristic of classification.

Further, in the step S2, the probability distribution of the theme of training set data is obtained as follows:

S21, setting topic model parameter alpha, theme number K extract microblogging from the Dirichlet distribution that parameter is α Theme distribution doc-topic matrix θ_m, θ_m~Dir (α), m ∈ [1, M], θ_mIndicate the theme probability distribution of document m

Wherein n_m,kIndicate the number of the kth descriptor of m microbloggings.

Further, in the step S2, the probability distribution of the descriptor of training set data is obtained as follows:

S22, setting topic model parameter beta, theme number K extract theme from the Dirichlet distribution that parameter is β Word be distributed topic-word matrixIndicate the probability distribution of the word of theme k

Wherein n_k,vIndicate the number that word v occurs under theme k.

Further, the step S3 is specifically included:

S31, subject information entropy TE is calculated, specifically:

TE (k)=- ∑ P (w | k) * lnP (w | k)

Wherein, P (w | k) indicates that word w appears in the probability under theme k；

S32, the relative entropy for calculating theme, specifically:

Wherein, P, Q are indicated wait measure distribution, when two random distributions are identical, relative entropy zero, when two random distributions Difference increase when, relative entropy also will increase；

S33, the average similarity for calculating theme, specifically:

The JS distance that theme is calculated using relative entropy, for measuring similarity between theme, specifically:

Average similarity is the independence for calculating other opposite distributions of a certain distribution, and the average similarity of theme calculates Method specifically:

Wherein j is not equal to K

Wherein K indicates theme sum；

S34, screening high-quality theme

Theme merit figue, calculation method are calculated according to theme entropy and average similarity specifically:

If theme merit figue meets G (k) > μ, μ is threshold value, then determines that the theme belongs to high-quality theme, standby as extending Choosing, is not otherwise high-quality theme, and then obtain high-quality theme collection S.

Further, in the step S4, theme division is carried out to training set specifically:

S41, the theme distribution that the topic model that training set trains is obtained, where selecting every microblogging in high-quality theme The affiliated theme of maximum probability value, by λ top ranked Feature Words w={ w of the corresponding descriptor select probability of affiliated theme₁, w₂,…w_λBe added in the text feature of training set as expansion word, word w is incorporated into if expansion word is not present in original text shelves In document.

Further, in the step S4, test set carries out theme deduction and carries out feature extension specifically:

S42, theme deduction is carried out to test set using the topic model that training set trains, obtains the text of test text Shelves-theme distribution matrix；To every affiliated theme select probability row of test text selection maximum probability value in high-quality theme collection S The highest λ Feature Words w={ w of name₁,w₂,…w_λBe added in the text feature of test set as expansion word.

Further, the step S5 specifically:

S51, the expanded text for obtaining step S41 using vector space model carry out text representation, by document d regard as to N-dimensional vector in quantity space calculates the weight of feature, vector v=(ε using TF-IDF₁,ε₂,…,ε_n),ε_iIt indicates i-th The weight of word, the calculating of weight specifically:

Wherein, tf_ijRefer to the frequency that some Feature Words occurs in certain text, df_iIt indicates to include Feature Words in corpus Textual data, M is corpus text sum；

S52, text classification is carried out using LIBSVM tool, the data format that document is converted is label 1:value 2: Value ..., wherein label is classification logotype, and 1,2, which are characterized the i.e. tf-idf of value, calculates weight；

S53, record training set class label Y={ y1, y2 ..., yn }, carry out test set to after training set training pattern Classification prediction.

The present invention also provides a kind of microblogging Text Classification Systems based on high-quality topic expansion, comprising:

Text collection unit constructs training set and test set for being acquired to the microblogging text data voluntarily acquired；

Text data pretreatment unit, for being pre-processed to urtext sample and carrying out feature selecting comprising:

Chinese word segmentation module for complete sentence to be divided into vocabulary, and rejects the stop words in text,

Chinese stoplist module for deleting the Feature Words in the deactivated vocabulary occurred in text, and rejects punctuate symbol Number,

Dictionary creation module for being ranked up to Feature Words in text, and summarizes Feature Words；

LDA model training unit, for obtaining document subject matter distribution and descriptor distribution situation by training set data, Include:

Data processing module is marked off excellent for calculating merit figue by descriptor distributed data by given threshold Matter theme；

The LDA model training unit is also used to using quality features word as the text of the text of training set extension and test set This extension；

Text classification unit, for carrying out text classification to the training set after text extension by LIBSVM tool, simultaneously Classify to the testing data of test set, generates classification results.

Compared with the prior art, the invention has the following advantages that

This method effectively extracts good theme by the microblogging file classification method based on high-quality topic expansion, uses The poor disadvantage of the sparse caused classifying quality of feature is efficiently solved after the extension of microblogging feature, it is compared with prior art, quasi- True rate is higher, is more applicable for the feature extension of text in the classification of the short texts such as microblogging, can be realized microblogging and effectively classify.Effectively Ground solves text feature extension caused by solution is mixed using descriptor when topic model extension microblogging text feature and inaccurately asks Topic.

The present invention can be widely popularized in Text Classification field based on the above reasons.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.

Fig. 1 is that the present invention is based on the flow charts of the microblogging file classification method of high-quality topic expansion.

Fig. 2 is that the present invention is based on the LDA probabilistic models of the microblogging file classification method of high-quality topic expansion.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

As shown in Figure 1, the present invention provides a kind of microblogging file classification method based on high-quality topic expansion, including it is as follows Step:

S3, theme entropy, the average phase of relative entropy and theme are calculated by the probability distribution that comentropy is applied to descriptor Like degree, to calculate theme merit figue, given threshold filters out high-quality theme；

It is described that microblogging text is carried out data prediction and carries out feature selecting to include the following steps:

As shown in Fig. 2, setting topic model parameter alpha, β, theme number K carry out parameter in the way of Gibbs sampling and estimate Meter.

Wherein n_m,kIndicate the number of the kth descriptor of m microbloggings.

Wherein n_k,vIndicate the number that word v occurs under theme k.

S31, theme entropy is calculated applied to theme distribution using comentropy and then marks off high-quality theme, calculate theme letter Entropy TE is ceased, specifically:

TE (k)=- ∑ P (w | k) * lnP (w | k)

Wherein, P (w | k) indicates that word w appears in the probability under theme k, and the value of TE is smaller, and distribution gap is more greatly different.From every From the point of view of a theme, a small amount of Feature Words occur in distribution with greater probability, other words occur with small probability, at this time theme it is representational compared with By force, theme noise is smaller.

S32, the relative entropy for calculating theme, relative entropy is the index for measuring otherness between probability distribution, specifically Are as follows:

S33, the average similarity for calculating theme, specifically:

Wherein j is not equal to K

Wherein K indicates theme sum；

S34, screening high-quality theme

Microblogging file classification method of the present invention based on high-quality topic expansion, in conjunction with independent svm method and LDA SVM method compares, and is significantly improved by experimental verification accuracy rate, as shown in table 1:

Table 1

Feature	Recall rate	Accuracy rate
			SVM	0.754	0.760
LDA+SVM	0.831	0.822
			High-quality theme+SVM	0.863	0.857

A kind of microblogging Text Classification System based on high-quality topic expansion, comprising:

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of microblogging file classification method based on high-quality topic expansion, which comprises the steps of:

S1, carry out data prediction to microblogging text and carry out feature selecting, by pretreated text construct training set with Test set；

S2, using pretreated training set data as the input of LDA model, obtain the probability distribution of the theme of training set data And the probability distribution of descriptor；

S3, the probability distribution that comentropy is applied to descriptor is calculated into theme entropy, while calculates being averaged for relative entropy and theme Similarity, to calculate theme merit figue, given threshold filters out high-quality theme；

S4, theme division is carried out to training set and test set respectively, each text is marked off excellent by LDA model theme distribution Maximum probability value in matter theme corresponds to the descriptor of theme, using the descriptor as expansion word be added separately to training set with In the text feature of test set；

S5, text representation is carried out to the text after extension using vector space model, and calculates each Feature Words with TF-IDF Weight, is changed into vector for training data and test data document, chooses useful feature, trains training set by classifier SVM, Then classification prediction is carried out to test set, generates classification results.

2. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that described right Microblogging text carries out data prediction and carries out feature selecting and include the following steps:

S11, Chinese word segmentation pretreatment is carried out to text, complete sentence is divided into vocabulary, to obtain corpus of text feature Collection；

Common conjunction, pronoun class stop words after S12, rejecting participle in text, are pre-processed using Chinese stoplist Operation is deleted the specific word if Feature Words exist and deactivate vocabulary, then is rejected to punctuation mark；

S13, will pretreatment hereinafter this according to generic divide building dictionary, the information of different classes of word is counted, to Feature Words There is total degree and carry out descending arrangement, the word of n is used as classification as such Feature Words before selecting every class to come after being summarized General characteristic.

3. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step In rapid S2, the probability distribution of the theme of training set data is obtained as follows:

S21, setting topic model parameter alpha, theme number K extract the master of microblogging from the Dirichlet distribution that parameter is α Topic distribution doc-topic matrix θ_m, θ_m~Dir (α), m ∈ [1, M], θ_mIndicate the theme probability distribution of document m

Wherein n_m,kIndicate the number of the kth descriptor of m microbloggings；

The probability distribution of the descriptor of training set data is obtained as follows:

S22, setting topic model parameter beta, theme number K extract the word of theme from the Dirichlet distribution that parameter is β It is distributed topic-word matrix Indicate the probability distribution of the word of theme k

Wherein n_k,vIndicate the number that word v occurs under theme k.

4. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step Rapid S3 is specifically included:

S31, subject information entropy TE is calculated, specifically:

TE (k)=- ∑ P (w | k) * lnP (w | k)

S32, the relative entropy for calculating theme, specifically:

Wherein, P, Q are indicated wait measure distribution, when two random distributions are identical, relative entropy zero, when the difference of two random distributions Not Zeng great when, relative entropy also will increase；

S33, the average similarity for calculating theme, specifically:

Average similarity is the independence for calculating other opposite distributions of a certain distribution, the average similarity calculation method of theme Specifically:

Wherein j is not equal to K

Wherein K indicates theme sum；

S34, screening high-quality theme

If theme merit figue meets G (k) > μ, μ is threshold value, then determines that the theme belongs to high-quality theme, alternative as extension, no It is not then high-quality theme, and then obtains high-quality theme collection S.

5. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step In rapid S4, theme division is carried out to training set specifically:

S41, the theme distribution that the topic model that training set trains is obtained, probability in high-quality theme where selecting every microblogging The affiliated theme of maximum value, by λ top ranked Feature Words w={ w of the corresponding descriptor select probability of affiliated theme₁,w₂,… w_λBe added in the text feature of training set as expansion word, word w is incorporated into document if expansion word is not present in original text shelves In.

6. the microblogging file classification method according to claim 5 based on high-quality topic expansion, which is characterized in that the step In rapid S4, test set carries out theme deduction and carries out feature extension specifically:

S42, theme deduction is carried out to test set using the topic model that training set trains, obtains document-master of test text Inscribe distribution matrix；To the selection of every test text, the affiliated theme select probability ranking of maximum probability value is most in high-quality theme collection S λ high Feature Words w={ w₁,w₂,…w_λBe added in the text feature of test set as expansion word.

7. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step Rapid S5 specifically:

S51, the expanded text for obtaining step S41 carry out text representation using vector space model, regard document d as vector sky Between in n-dimensional vector, the weight of feature, vector v=(ε are calculated using TF-IDF₁,ε₂,…,ε_n),ε_iIndicate i-th of word Weight, the calculating of weight specifically:

Wherein, tf_ijRefer to the frequency that some Feature Words occurs in certain text, df_iIndicate the text in corpus comprising Feature Words This number, M are corpus text sums；

S53, record training set class label Y={ y1, y2 ..., yn }, to classifying after training set training pattern to test set Prediction.

8. a kind of microblogging Text Classification System based on high-quality topic expansion characterized by comprising

Chinese stoplist module for deleting the Feature Words in the deactivated vocabulary occurred in text, and rejects punctuation mark,

LDA model training unit, for obtaining document subject matter distribution and descriptor distribution situation, packet by training set data It includes:

Data processing module marks off high-quality master by given threshold for calculating merit figue by descriptor distributed data Topic；

The LDA model training unit is also used to expand quality features word as the text of the extension of the text of training set and test set Exhibition；

Text classification unit, for carrying out text classification to the training set after text extension by LIBSVM tool, while to survey The testing data of examination collection is classified, and classification results are generated.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run, Execute method described in any one of claims 1 to 7 claim.

10. a kind of electronic device, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that the processor by computer program operation execute the claim 1 to Method described in any one of 7 claims.