CN108228721A

CN108228721A - Fast text clustering method on large corpora

Info

Publication number: CN108228721A
Application number: CN201711290927.3A
Authority: CN
Inventors: 李林蔚; 郭良琛; 马会心; 何震瀛; 荆楠; 荆一楠; 王晓阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-06-29
Anticipated expiration: 2037-12-08
Also published as: CN108228721B

Abstract

The invention belongs to relational database technology field, the fast text clustering method on specially a kind of large corpora.Since text data usually has higher-dimension and sparse feature, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and the method such as multinomial mixed model of Di Li Crays based on generation model is phenomenologically more prominent.By the present invention in that the symmetrical priori and construction that are distributed with Di Li Crays index to optimize, make to only rely upon the number of various words in document total time, thus also can Effec-tive Function in the longer document of length.

Description

Fast text clustering method on large corpora

Technical field

The invention belongs to relational database technology fields, and in particular to the fast text cluster side on a kind of large corpora Method.

Background technology

Text cluster is a kind of FAQs in data mining, is the important hand effectively organized to text message Section, plays an important role in the research of natural language processing etc..

Since text data is only made of word, with it is other by extraction characteristics compared with, usual dimension higher and More sparse, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and based on the method for generation model As the multinomial mixed model of Di Li Crays is phenomenologically more prominent.

However the multinomial mixed model the time it takes of Di Li Crays is directly proportional to Document Length, for large corpora Speech, document therein is often larger, causes convergence rate not ideal enough, has influenced whole data-handling efficiency.

Invention content

The purpose of the present invention is for large corpora, propose it is a kind of the method for text cluster quickly is carried out to it, so as to In subsequent data processing.

The method that the present invention proposes the fast text cluster on large corpora, is as follows：

1st, the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first, referring to figure Shown in 2.

In the index, i-th of element a_iValue beAfter establishing index a kind of in this way,Value can be completed by single divisionBy the complexity of calculating Degree falls below O (1) from O (n).

2nd, total classification number K in hyper parameter α, β and cluster process is provided by user, uses Di based on gibbs sampler The sharp multinomial mixed model of Cray, infers the class number belonging to each document, detailed process is：

2.1st, for any document in corpusA class number z is randomly assigned for it_i；

2.2nd, all documents are traversed, and according to the current class situation of documents other in corpus, according to Di Li Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as：

Distribution formula derivation is as follows：

Symbol used in formula and meaning are as follows：

Result of the distribution formula after deriving and simplifying is as follows：

After being optimized by index, O (1) is dropped to the computation complexity of denominator, the computation complexity of molecule is proportional to Unduplicated word number in document；

2.3, for the distribution p (x) for needing to sample, choose the motion distribution q (x) that is more easy to sample and meet following property：If For the i-th step with q (xⁱ|x^i-1) transition probability structure Markov chain, shift each shape probability of state after enough multisteps Convergence in distribution is in p (x)；

2.4 samplings obtain initial sample x⁰~q (x)；

2.5 samplings obtain x^cand~q (x^cand|x^i-1), calculating acceptance probability is：

And this sampled result is received with above-mentioned probability, even xⁱFor x^cand, do not receive, then enable xⁱFor x^i-1；

In sampling process, used since motion distribution can take turns iteration for n, when n is larger, alias method can be passed through The time complexity for obtaining n sample is shared equally O (1) by O (K)；

Equally in the process, when the cluster number of some document between two-wheeled iteration does not change, we There is x^cand=xⁱ, do not need at this time calculate acceptance probability value, so as to accelerate sampling process；

2.6 repeat step 2.5 to predetermined number of times；

2.7 return to current xⁱAs sampled result；

2.8 repeat step 2.2-2.7, until convergence；

As a kind of optimization, in sampling process, it is for formCalculating, if changing in two-wheeled Only a small amount of f (n between generation_kw) value change, then can be only for n_kwIt is worth changed w to be calculated, to reduce sampling The time complexity of process.Wherein, generally using 20% as threshold value on a small quantity, can specifically be adjusted according to actual conditions, Theoretically there is effect of optimization less than 50%, i.e., a small amount of range can be 20%--50%；

2.9 class numbers being assigned using each document provide cluster result as standard.

By the present invention in that the symmetrical priori and construction index that are distributed with Di Li Crays optimize, make total time only according to Rely the number of the various words in document, thus also can Effec-tive Function in the longer document of length.

Description of the drawings

Fig. 1 is the graph model of the multinomial mixing of Di Li Crays in this method.

Fig. 2 is the process for sequentially establishing index.

Specific embodiment

For the convenience of description, hereinafter the fast text clustering method of tape index optimization is referred to as IGSDMM by us.

Advantage of the present invention relative to existing clustering algorithm will be introduced by taking two datasets as an example.The introduction of data set is such as Under：

NG20.The data set contains the 18 of 20 western mainstream newsgroups, 846 documents.This is a classical use To weigh the method for Text Clustering Algorithm.The average length of document is 137.85 in NG20, and average vocabulary number is 91.

Tweet.The data set pushes away spy by 2472 and forms, and related to 89 inquiries.Push away the special pass between inquiry System is by manually marking.It is 8.56 to push away special average length, and average vocabulary number is 7.

Standardization mutual information (NMI) is widely used in the quality for weighing cluster result.NMI measures representative cluster point The statistical information shared between the true class label of stochastic variable and document matched.The formal definition of NMI is as follows：

Wherein, n_cIt is the quantity of document in classification c, n_kIt is the quantity of document in cluster k, n_c,kIt is not only in classification c but also in cluster k Document quantity, N is then the quantity of document in data set.When cluster result and authentic signature perfect matching, NMI is exactly 1. It is randomly generated when cluster result, NMI values can be closer to 0.

The performance of IGSDMM is compared by we with K-means and LDA.The operating parameter of IGSDMM is set as, α and β 0.1 is set as, iterations are set as 30.Operation result is as shown in the table：

In comparison process, also K values are set, are 0.5 times of true K values respectively, 1 times, 2 times, 3 times.Meanwhile In order to which the randomness for ensureing algorithm does not influence comparison result, each numerical value takes after running 10 times and is worth to,

From table it can be seen that, for the performance of ng20 or tweet, IGSDMM be better than other two it is classical Clustering algorithm.Meanwhile it can be observed that for the algorithm, K settings it is larger when, performance can also increase.And for The relatively conference of K-means, K value setting leads to the decline of performance.

For the algorithm, one larger K value of setting will not only lead to the decline of performance, also result in better property Energy.This is because the algorithm can speculate the quantity of a suitable cluster automatically.So the quantity of one larger cluster of setting is that have It is necessary.

Claims

1. the method for the fast text cluster on a kind of large corpora, which is characterized in that be as follows：

(1) the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first；

In the index, i-th of element a_iValue beAfter establishing index a kind of in this way,Value completed by single division

(2) total classification number K in hyper parameter α, β and cluster process is provided by user, uses the Di Li based on gibbs sampler The multinomial mixed model of Cray infers the class number belonging to each document, detailed process is：

(2.1) for any document in corpusA class number z is randomly assigned for it_i；

(2.2), all documents are traversed, and according to the current class situation of documents other in corpus, according to Di Sharp Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as：

Result after distribution formula is simplified is as follows：

(2.3) distribution p (x) sampled for needs chooses the motion distribution q (x) that is more easy to sample and meet following property：It is if right In the i-th step with q (xⁱ|x^i-1) transition probability structure Markov chain, shift each shape probability of state point after enough multisteps Cloth converges on p (x)；

(2.4) sampling obtains initial sample x⁰~q (x)；

(2.5) sampling obtains x^cand~q (x^cand|x^i-1), calculating acceptance probability is：

(2.6) step (2.5) is repeated to predetermined number of times；

(2.7) current x is returnedⁱAs sampled result；

(2.8) step (2.2-2.7) is repeated, until convergence；

(2.9) class number being assigned using each document provides cluster result as standard；

Symbol used in formula and meaning are as follows：

2. according to the method described in claim 1, it is characterized in that, in sampling process, it is for form Calculating, if between two-wheeled iteration only have a small amount of f (n_kw) value change, then only for n_kwValue becomes The w of change is calculated.