CN109726394A

CN109726394A - Short text Subject Clustering method based on fusion BTM model

Info

Publication number: CN109726394A
Application number: CN201811546170.4A
Authority: CN
Inventors: 贾海涛; 李泽华; 刘小清; 任利; 贾宇明; 赫熙煦; 周焕来; 罗心; 王启杰; 李清
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2019-05-07

Abstract

The invention discloses a kind of short text Subject Clustering methods based on fusion BTM model, belong to data clusters technical field.The present invention carries out Text Pretreatment to short text to be clustered first, obtains data set D；Then the text vector based on BTM model, VSM model is extracted respectively；When carrying out k-means cluster to data set D, based on estimation cluster numbers k mode cluster numbers obtained set by the present invention, carry out k clustering processing, and the cluster standard used when clustering processing are as follows: the weighted sum of the distance between any two text calculated separately based on two text vectors.Present invention combination BTM model and VSM model realization are to the clustering processing of short text theme, to improve Clustering Effect；The technical issues of interior, between class distance measures Clustering Effect based on class simultaneously, automatically adjusts the quantity that clusters, and compensation BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.

Description

Short text Subject Clustering method based on fusion BTM model

Technical field

The invention belongs to data clusters technical fields, and in particular to a kind of short text theme based on fusion BTM model is poly- Class method.

Background technique

Currently, mainly there is BTM (Biterm topic model) model, VSM (Vector about the model of short text theme Space Model) model and LDA (Latent Dirichlet Allocation) model.

Wherein, BTM model is a kind of text subject model, but it and traditional topic model such as PLSA are (stealthy semantic Analysis) or LDA have apparent difference.General traditional topic model is only applicable to long article present treatment, because of the feature of short text It is sparse and missing can produce serious influence to model foundation, but also have many researchers attempt to be extended model and Optimization, to enhance the applicability to short text.Such as it is spelled by introducing external knowledge to expand short text, or by short text It connects, is handled as pseudo- long text.Although this way can not overcome conventional model day with the deficiency on improved model Raw disadvantage, and the modeling process of BTM model can obtain better effects to avoid disadvantages mentioned above.

VSM model, i.e. vector space model, principle is fairly simple, i.e., will carry out table with based on space vector in text Show, it is subsequent the operation method of vector to be used to carry out operation to text.So by a text be mapped to vector space it Afterwards, the similarity between text can be measured by distance between vector, and should be readily appreciated that.

Currently, classical clustering algorithm has k-means and k-medoids etc., but this kind of algorithm needs specified in advance gather Class number, while optimum cluster number cannot be assessed in advance.

Summary of the invention

Goal of the invention of the invention is: in view of the above problems, in conjunction with BTM model and VSM model realization to short The clustering processing of text subject, to improve Clustering Effect；Based on class, interior, between class distance measures Clustering Effect simultaneously, from It is dynamic to adjust the quantity that clusters, compensate the technical issues of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity.

Short text Subject Clustering method based on fusion BTM model of the invention, including the following steps:

Step S1: Text Pretreatment is carried out to short text to be clustered, obtains data set D；

Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme Distribution matrix θ and theme-word distribution matrix

And the text vector of any text i in data set D is indicated based on document-theme distribution matrix θ, it is denoted as d_{i_btm}；

Step S3: being based on data set D, carries out text vector table to any text i in data set D based on TF-IDF strategy Show, is denoted as d_{i_vsm}；

Step S4: initialization tag positionK=k_min, wherein k_minUnder the cluster number for indicating k-means algorithm Limit；

Step S5: if k > k_max, then follow the steps 8；It is no to then follow the steps S6；Wherein k_maxIndicate the poly- of k-means algorithm The upper limit of class number；

Step S6: randomly selecting k text vector as initial cluster center from data set D, and is calculated based on k-means Method carries out k clustering processing to data set D, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result；

Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes Step S7；Otherwise step S7 is directly executed；

Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance, Middle first distance is based on text vector d_{i_btm}JS (Jensen-Shannon) distance, second distance be based on text vector d_{i_vsm}COS distance；

Clustering result qualityWherein I (k) indicates that inter- object distance, B (k) indicate between class distance；

The inter- object distance is the minimum value of each text and the average distance of other texts；

The distance of the between class distance two texts nearest between different clusters；

Step S7: after updating k=k+1, step S5 is continued to execute；

Step S8: carrying out k clustering processing based on k-means algorithm, wherein

In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are: special in conjunction with both BTM and VSM Text feature is put and optimized, improves Clustering Effect, while interior, between class distance measures Clustering Effect based on class, it is automatic to adjust The quantity that clusters is saved, the problem of BTM model need to shift to an earlier date accuracy decline caused by preassignment theme quantity is compensated.

Detailed description of the invention

Fig. 1 is the schematic diagram of BTM voca.txt input format；

Fig. 2 is the schematic diagram of the defeated doc_wids.txt input format of BTM；

Fig. 3 is fusion BTM Model tying flow chart；

Fig. 4 is participle number result figure schematic diagram；

Fig. 5 is BTM model word to number result schematic diagram；

Fig. 6 is document space vector matrix schematic diagram；

Fig. 7 is BTM model document-theme distribution matrix schematic diagram；

Fig. 8 is BTM and LDA the F value comparison diagram at different themes number K；

Fig. 9 is Fusion Model difference λ valued curve；

Figure 10 is between class distance schematic diagram in different cluster number lower classes；

Figure 11 is each Model tying accuracy rate comparison diagram；

Figure 12 is each model F value comparison diagram.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.

Short text topic model BTM (Biterm Topic Model) mainly for unitary document-theme distribution probability into Words-frequency feature in single document has been desalinated in row research, and vector space model (Vector Space Model) stresses word frequency spy Sign, when in order to overcome the shortcomings of that two models are implemented separately, both the present invention combines feature simultaneously optimizes text feature, improves cluster Effect, while based on class, interior, between class distance measures Clustering Effect, automatically adjusts the quantity that clusters, compensation BTM model need to mention The problem of accuracy decline caused by preceding preassignment theme quantity.

The inner principle of VSM is fairly simple, i.e., will be indicated in text with based on space vector, subsequent to use The operation method of vector carries out operation to text.So the similarity after a text is mapped to vector space, between text It can be measured, and be should be readily appreciated that by distance between vector.

Assuming that have m documents, wherein shared n different words, therefore this m document can be based on n word w_i(i=1 ..., N) it is indicated.So document vector space model can be expressed as d_i={ w₁,w₂,w₃...w_n, the text vector of m documents Matrix D can indicate are as follows:

Wherein, every row of the matrix represents a text, and each column represents an independent word, element d in matrix_ijIndicate the J word proportion in i-th of document.

BTM is the model based on statistics, it has evaded traditional short text model vulnerable to the sparse influence of short essay eigen The shortcomings that, using implicit semantic between word, map the text to theme spatially.So based on this model, it can be deduced that more manage Document-the theme and theme thought-word distribution.In present embodiment, document spy is indicated based on document-theme distribution matrix Sign is labeled as d_btm.Meanwhile short text is subjected to vectorization expression using TF-IDF weighted strategy, it is labeled as d_vsm, finally in text Weighted Fusion coefficient lambda is introduced during shelves similarity calculation, carries out fusion treatment.

The wherein d of i-th of text_vsm、d_btmVector expression be respectively as follows:

d_{i_vsm}={ w₁,w₂,w₃...w_n}；

d_{i_btm}={ p (z₁|d),p(z₂|d),p(z₃|d)....p(z_k|d)}；

Wherein p (z_j| d) indicate j-th of theme z_jIn the distribution probability of text d, j=1,2 ..., K；

TF-IDF is the most classical and most common method being weighted in vector space to Feature Words.It is specific real at this It applies in mode, using word independent in short text as characteristic item, corresponding power is weighted by TF and IDF.

TF (Term Frequency) indicates certain Feature Words t_kThis d in specific certain text_iThe number of middle appearance.With out Occurrence number increases, and shows that specific gravity shared by the specific word is bigger, and the number occurred is labeled as TF_(i,k).But for short For text, Feature Words wherein included frequency of occurrence in the text is totally close, is difficult to judge so relying solely on TF Text general characteristic.

IDF (Inverse Document Frequency) refers to the text that Feature Words go out in text set except current text The number occurred in this.It is more multiple if there is number, illustrate that the word cannot distinguish text well, if there is number It is less, illustrate that the word can be taken as the feature of certain texts to be considered, the calculation method of IDF are as follows:

Wherein, i is document (text) specificator, and k is word specificator, and N indicates document library In all text item numbers, n expression include word t_kText item number, α is an empirical value, takes α=0.01 under normal circumstances.

It can to sum up obtain, any text d_iMiddle Feature Words t_kWeight are as follows: w_(i,k)=TF_(i,k)×IDF_(i,k), that is, use weight w (i, k) characterizes d_{i_vsm}={ w₁,w₂,w₃...w_nIn each word, to obtain based on TF-IDF strategy in data set D Any text i carries out text vector.

Text passes through the vector d of theme distribution_btmIt is indicated, so the present invention is based on the masters of text in subsequent contrast It inscribes feature and carries out Similarity measures.Simultaneously because theme vector is to be obtained with statistical model, therefore be described using probability.

Using the symmetrical version based on KL distance (Kullback-Leibler difference), i.e. JS distance is measured.I.e. pair In any text d_iAnd d_jBetween first distance (the first similarity) can indicate are as follows:

Wherein,Function

And the full text vector d for being indicated based on Feature Words word weight_vsm, then take directly calculate two vector cosine away from From being measured, i.e., for any text d_iAnd d_jBetween second distance (the second similarity) can indicate are as follows:

After calculating completion both the above similarity, weighting coefficient λ is introduced, fusion measurement is weighted, is formed based on JS The weighting Text similarity computing formula of distance: D (d_i,d_j)=λ D_btm(d_i,d_j)+(1-λ)D_vsm(d_i,d_j)。

Referring to Fig. 1, in present embodiment, the short text Subject Clustering based on above-mentioned fusion BTM model includes following Step:

(1) Text Pretreatment is carried out to short text to be clustered, obtains data set D, wherein Text Pretreatment includes to short Text is segmented and is removed stop words processing；

Wherein, short text to be clustered is usual are as follows: has microblogging, discussion bar, forum's comment or communication behavior after Log Clustering User conversation in short text data.

(2) processing of BTM model preprocessing, i.e. regularization, generates two documents respectively, and one of document is dictionary text Shelves, are denoted as voca.txt, it is therefore an objective to be numbered for each word；Another document is denoted as doc_winds.txt, and being used for will be entire Replacement is numbered in word in text set.The specific format of two files is as shown in Figure 2,3.

(3) it is modeled based on BTM model, generates document-theme distribution matrix θ and theme-word distribution matrix

The input of model is two document of words.txt and words_winds.txt, theme number K and Dirichlet priori Two hyper parameters α and β of distribution, wherein α=50/K and β=0.01.Output is document-theme distribution matrix θ and theme-word point Cloth matrixSince the modeling process of BTM model is the prior art, details are not described herein again.

(4) text vector expression is carried out based on TF-IDF strategy；

(5) document-theme vector and document VSM vector generated based on k-means algorithm to BTM model is weighted poly- Class；

(6) clustering for cluster generation is described, makes reference for network management or public opinion monitoring.

The short text Subject Clustering of fusion BTM model i.e. of the invention the specific implementation process is as follows:

Step S1: data set D to be clustered, and the theme number of setting BTM, the cluster of k-means algorithm are inputted Number range [k_min,k_max]；

Step S2: being based on data set D, carries out BTM model modeling, generates document-theme distribution matrix θ and theme-word point Cloth matrix

Step S3: being based on data set D, carries out text vector expression based on TF-IDF strategy, obtains d_vsm；

Step S4: initialization tag positionK=k_min；

Step S5: if k > k_max, then follow the steps 8；It is no to then follow the steps S6；

Step S6: k initial cluster center is randomly selected from data set D, and k cluster is carried out based on k-means algorithm Processing calculates the clustering result quality J (k) of corresponding k value based on obtained cluster result；

Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, it then executes Step S7；Otherwise step S7 (marker bit is directly executedIt remains unchanged)；

Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance；

The inter- object distance is the minimum value of each text and the average distance of other texts, it may be assumed that

Wherein, | C_j| it indicates in cluster C_jIn amount of text, x_i、x_pRespectively indicate belonging cluster C_jIn text pair As.

The distance of the between class distance two texts nearest between different clusters, it may be assumed that

Wherein, i=1,2 ..., k；J=1,2 ..., k；And i ≠ j, x_p,x_qIndicate cluster C_iAnd C_jIn text object.

Step S7: after updating k=k+1, step S5 is continued to execute；

Embodiment

The experimental data of use comes from Sina weibo, and modeling and clustering are carried out in data based on this paper algorithm. Experimental section mainly includes the weighting algorithm clustering algorithm of fusion full text this VSM and BTM, and BTM, VSM and LDA is used alone Algorithm carry out comparison.

Wherein experimental data set (corpus of use) is the open source data from Sina weibo in May, 2014, wherein altogether Have 6 hotspots (room rate, civil servant, South Korean TV soaps, haze, transgenosis, mobile phone), each classification 1000 and already provided with classification Label, vocabulary size: 9372 words.

Before experiment carries out, the pretreatment work of data is carried out first, text is subjected to participle number, BTM model needs Word is carried out to number.Number result is segmented as shown in figure 4, word is as shown in Figure 5 to numbering.

After completing data prediction, document vectorization expression is carried out to based on IF-IDF algorithm, forms VSM model, text Shelves indicate that result is as shown in Figure 6.

(1) BTM and LDA Subject Clustering Contrast on effect:

BTM model is a kind of non-supervisory model, and human intervention is not needed in operation, but a disadvantage is that needing before modeling Artificially default document sets overall theme clusters number K, and whether the setting of K is bonded the actual conditions of data, influences whether model Clustering performance.

In this experiment, previously known document sets have 6 classifications, that is, have 6 topics, to separately verify BTM and LDA master Inscribe the optimal number of topics K of model.For the optimal number of topics K of determination, this experiment number of topics takes 5,6,7,8,9,10,11,12 respectively, 13,14 are tested, and model the number of iterations is 1000 times, α=50/K, β=0.01.It is calculated after model training based on k-means Method is clustered, and number k=6 is clustered, since the clustering algorithm is easily trapped into locally optimal solution, so repeating 10 in experiment every time Secondary cluster experiment, takes the average F value (while considering accuracy rate and recall rate) of cluster result to be assessed.After experiment, really Fixed optimal number of topics K will be used in subsequent experimental.Text-theme matrix is as shown in Figure 7 in BTM model.

From the experimental result in table 1 and Fig. 8 it can be seen that, clustering topics are carried out in the case where the classification of known document collection Number K verifying, the Subject Clustering effect that BTM model is obtained when K takes 6 is best, and Subject Clustering effect is most when K takes 10 for LDA model Good, on the whole, BTM effect is better than LDA.

With the increase of number of topics K, two modelling effects have different degrees of decrease.This illustrates in BTM model, in word pair In the case that quantity is constant, preset themes quantity, which deviates really quantity, will lead to original theme probability too much and is subdivided, and cause text The variation of shelves-theme distribution eventually leads to the result for carrying out clustering documents according to this probability distribution and also changes correspondingly.In LDA model In, modeling effect is integrally poor, the reason is that model is influenced by short essay eigen is sparse, causes document-theme distribution inaccurate Really, and then Clustering Effect is poor.

Table 1 BTM and LDA models the comparison of F value under different K values

(2) setting of fusion coefficients λ:

Based on above-mentioned experimental result, this experiment takes BTM model number of topics K=6, k-means algorithm to cluster number k ∈ It is tested respectively when [5,15], last whole result takes mean value to be indicated.In experiment, if λ=0,0.1,0.2,0.3, 0.4,0.5,0.6,0.7,0.8,0.9,1.0 }, Subject Clustering precision is separately verified.In the result it can be found that working as λ=0.8 When, whole cluster result is relatively good, in subsequent experimental, λ=0.8 is taken to be tested.Experimental result is as shown in Figure 9.

(3) optimum cluster number determines:

In a practical situation, in this case it is not apparent that the classification situation of document, so this experiment does not know cluster number in default In the case where, Clustering Effect assessment is carried out based between class distance in class.In fusion BTM model number of topics K=6, the feelings of λ=0.8 Under condition, input cluster range [5,15] is clustered.Figure 10 is between class distance ratio in the class obtained under different cluster numbers Line chart.It can be seen that, when taking cluster number to be 6, between class distance ratio is minimum in class, and Clustering Effect is best in the result.Knot Fruit is consistent with data set features, illustrates the validity of between class distance measurement cluster result in class.So taking the clusters number to be 6 carry out subsequent experiment.

(4) each Model tying Contrast on effect:

6 classifications that this experiment is mainly based upon document sets carry out Clustering Effect comparison.Comparison model is fusion BTM mould Type, BTM model, LDA model and VSM model.In model, the number of topics K of BTM model takes 6, LDA number of topics K to take 10, k-means Algorithm cluster number is set as 6.In experimental result, accuracy rate P and F value is respectively adopted, Clustering Effect is evaluated.Cluster is calculated Method Clustering Effect is better, and corresponding accuracy rate P value is bigger, and F value is then that accuracy rate and recall rate are contemplated simultaneously, cluster Effect is also proportional to the size of F value, and the Clustering Effect of several models is as shown in the table:

Each model accuracy rate P of table 2

Figure 11 gives each Model tying accuracy rate contrast curve chart of corresponding table 2, as seen from the figure, accuracy rate of the invention Performance is best.

The comparison of each model F value of table 3

Figure 12 gives each Model tying F value contrast curve chart of corresponding table 3, and as seen from the figure, advantage of the invention is most bright It is aobvious.

The above description is merely a specific embodiment, any feature disclosed in this specification, except non-specifically Narration, can be replaced by other alternative features that are equivalent or have similar purpose；Disclosed all features or all sides Method or in the process the step of, other than mutually exclusive feature and/or step, can be combined in any way.

Claims

1. a kind of short text Subject Clustering method based on fusion BTM model, characterized in that it comprises the following steps:

Step S2: being based on data set D and preset theme number K, carries out BTM model modeling, generates document-theme distribution Matrix and theme-word distribution matrix；

And show the text vector of any text i in data set D based on document-theme distribution matrix table, it is denoted as d_{i_btm}；

Step S3: being based on data set D, carries out text vector expression to any text i in data set D based on TF-IDF strategy, It is denoted as d_{i_vsm}；

Step S4: initialization tag positionK=k_min, wherein k_minIndicate the lower limit of the cluster number of k-means algorithm；

Step S5: if k > k_max, then follow the steps S8；It is no to then follow the steps S6；Wherein k_maxIndicate the cluster of k-means algorithm The upper limit of number；

Step S6: randomly selecting k text vector as initial cluster center from data set D, and is based on k-means algorithm pair Data set D carries out k clustering processing, and the clustering result quality J (k) of corresponding k value is calculated based on obtained cluster result；

Defined label positionThe clustering result quality of corresponding k isIfThen updateAfterwards, then step is executed S7；Otherwise step S7 is directly executed；

Wherein, when carrying out clustering processing, the cluster standard of use are as follows: the weighted sum of first distance and second distance, wherein the One distance is based on text vector d_{i_btm}JS distance, second distance be based on text vector d_{i_vsm}COS distance；

Step S7: after updating k=k+1, step S5 is continued to execute；

2. the method as described in claim 1, which is characterized in that the weight of first distance is set as 0.8, the weight of second distance It is 0.2.

3. method described in claim 1, which is characterized in that Text Pretreatment include short text is segmented and is removed it is deactivated Word processing.