CN108228721A - Fast text clustering method on large corpora - Google Patents

Fast text clustering method on large corpora Download PDF

Info

Publication number
CN108228721A
CN108228721A CN201711290927.3A CN201711290927A CN108228721A CN 108228721 A CN108228721 A CN 108228721A CN 201711290927 A CN201711290927 A CN 201711290927A CN 108228721 A CN108228721 A CN 108228721A
Authority
CN
China
Prior art keywords
document
cluster
index
result
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711290927.3A
Other languages
Chinese (zh)
Other versions
CN108228721B (en
Inventor
李林蔚
郭良琛
马会心
何震瀛
荆楠
荆一楠
王晓阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201711290927.3A priority Critical patent/CN108228721B/en
Publication of CN108228721A publication Critical patent/CN108228721A/en
Application granted granted Critical
Publication of CN108228721B publication Critical patent/CN108228721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to relational database technology field, the fast text clustering method on specially a kind of large corpora.Since text data usually has higher-dimension and sparse feature, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and the method such as multinomial mixed model of Di Li Crays based on generation model is phenomenologically more prominent.By the present invention in that the symmetrical priori and construction that are distributed with Di Li Crays index to optimize, make to only rely upon the number of various words in document total time, thus also can Effec-tive Function in the longer document of length.

Description

Fast text clustering method on large corpora
Technical field
The invention belongs to relational database technology fields, and in particular to the fast text cluster side on a kind of large corpora Method.
Background technology
Text cluster is a kind of FAQs in data mining, is the important hand effectively organized to text message Section, plays an important role in the research of natural language processing etc..
Since text data is only made of word, with it is other by extraction characteristics compared with, usual dimension higher and More sparse, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and based on the method for generation model As the multinomial mixed model of Di Li Crays is phenomenologically more prominent.
However the multinomial mixed model the time it takes of Di Li Crays is directly proportional to Document Length, for large corpora Speech, document therein is often larger, causes convergence rate not ideal enough, has influenced whole data-handling efficiency.
Invention content
The purpose of the present invention is for large corpora, propose it is a kind of the method for text cluster quickly is carried out to it, so as to In subsequent data processing.
The method that the present invention proposes the fast text cluster on large corpora, is as follows:
1st, the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first, referring to figure Shown in 2.
In the index, i-th of element aiValue beAfter establishing index a kind of in this way,Value can be completed by single divisionBy the complexity of calculating Degree falls below O (1) from O (n).
2nd, total classification number K in hyper parameter α, β and cluster process is provided by user, uses Di based on gibbs sampler The sharp multinomial mixed model of Cray, infers the class number belonging to each document, detailed process is:
2.1st, for any document in corpusA class number z is randomly assigned for iti
2.2nd, all documents are traversed, and according to the current class situation of documents other in corpus, according to Di Li Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as:
Distribution formula derivation is as follows:
Symbol used in formula and meaning are as follows:
Result of the distribution formula after deriving and simplifying is as follows:
After being optimized by index, O (1) is dropped to the computation complexity of denominator, the computation complexity of molecule is proportional to Unduplicated word number in document;
2.3, for the distribution p (x) for needing to sample, choose the motion distribution q (x) that is more easy to sample and meet following property:If For the i-th step with q (xi|xi-1) transition probability structure Markov chain, shift each shape probability of state after enough multisteps Convergence in distribution is in p (x);
2.4 samplings obtain initial sample x0~q (x);
2.5 samplings obtain xcand~q (xcand|xi-1), calculating acceptance probability is:
And this sampled result is received with above-mentioned probability, even xiFor xcand, do not receive, then enable xiFor xi-1
In sampling process, used since motion distribution can take turns iteration for n, when n is larger, alias method can be passed through The time complexity for obtaining n sample is shared equally O (1) by O (K);
Equally in the process, when the cluster number of some document between two-wheeled iteration does not change, we There is xcand=xi, do not need at this time calculate acceptance probability value, so as to accelerate sampling process;
2.6 repeat step 2.5 to predetermined number of times;
2.7 return to current xiAs sampled result;
2.8 repeat step 2.2-2.7, until convergence;
As a kind of optimization, in sampling process, it is for formCalculating, if changing in two-wheeled Only a small amount of f (n between generationkw) value change, then can be only for nkwIt is worth changed w to be calculated, to reduce sampling The time complexity of process.Wherein, generally using 20% as threshold value on a small quantity, can specifically be adjusted according to actual conditions, Theoretically there is effect of optimization less than 50%, i.e., a small amount of range can be 20%--50%;
2.9 class numbers being assigned using each document provide cluster result as standard.
By the present invention in that the symmetrical priori and construction index that are distributed with Di Li Crays optimize, make total time only according to Rely the number of the various words in document, thus also can Effec-tive Function in the longer document of length.
Description of the drawings
Fig. 1 is the graph model of the multinomial mixing of Di Li Crays in this method.
Fig. 2 is the process for sequentially establishing index.
Specific embodiment
For the convenience of description, hereinafter the fast text clustering method of tape index optimization is referred to as IGSDMM by us.
Advantage of the present invention relative to existing clustering algorithm will be introduced by taking two datasets as an example.The introduction of data set is such as Under:
NG20.The data set contains the 18 of 20 western mainstream newsgroups, 846 documents.This is a classical use To weigh the method for Text Clustering Algorithm.The average length of document is 137.85 in NG20, and average vocabulary number is 91.
Tweet.The data set pushes away spy by 2472 and forms, and related to 89 inquiries.Push away the special pass between inquiry System is by manually marking.It is 8.56 to push away special average length, and average vocabulary number is 7.
Standardization mutual information (NMI) is widely used in the quality for weighing cluster result.NMI measures representative cluster point The statistical information shared between the true class label of stochastic variable and document matched.The formal definition of NMI is as follows:
Wherein, ncIt is the quantity of document in classification c, nkIt is the quantity of document in cluster k, nc,kIt is not only in classification c but also in cluster k Document quantity, N is then the quantity of document in data set.When cluster result and authentic signature perfect matching, NMI is exactly 1. It is randomly generated when cluster result, NMI values can be closer to 0.
The performance of IGSDMM is compared by we with K-means and LDA.The operating parameter of IGSDMM is set as, α and β 0.1 is set as, iterations are set as 30.Operation result is as shown in the table:
In comparison process, also K values are set, are 0.5 times of true K values respectively, 1 times, 2 times, 3 times.Meanwhile In order to which the randomness for ensureing algorithm does not influence comparison result, each numerical value takes after running 10 times and is worth to,
From table it can be seen that, for the performance of ng20 or tweet, IGSDMM be better than other two it is classical Clustering algorithm.Meanwhile it can be observed that for the algorithm, K settings it is larger when, performance can also increase.And for The relatively conference of K-means, K value setting leads to the decline of performance.
For the algorithm, one larger K value of setting will not only lead to the decline of performance, also result in better property Energy.This is because the algorithm can speculate the quantity of a suitable cluster automatically.So the quantity of one larger cluster of setting is that have It is necessary.

Claims (2)

1. the method for the fast text cluster on a kind of large corpora, which is characterized in that be as follows:
(1) the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first;
In the index, i-th of element aiValue beAfter establishing index a kind of in this way,Value completed by single division
(2) total classification number K in hyper parameter α, β and cluster process is provided by user, uses the Di Li based on gibbs sampler The multinomial mixed model of Cray infers the class number belonging to each document, detailed process is:
(2.1) for any document in corpusA class number z is randomly assigned for iti
(2.2), all documents are traversed, and according to the current class situation of documents other in corpus, according to Di Sharp Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as:
Result after distribution formula is simplified is as follows:
(2.3) distribution p (x) sampled for needs chooses the motion distribution q (x) that is more easy to sample and meet following property:It is if right In the i-th step with q (xi|xi-1) transition probability structure Markov chain, shift each shape probability of state point after enough multisteps Cloth converges on p (x);
(2.4) sampling obtains initial sample x0~q (x);
(2.5) sampling obtains xcand~q (xcand|xi-1), calculating acceptance probability is:
And this sampled result is received with above-mentioned probability, even xiFor xcand, do not receive, then enable xiFor xi-1
(2.6) step (2.5) is repeated to predetermined number of times;
(2.7) current x is returnediAs sampled result;
(2.8) step (2.2-2.7) is repeated, until convergence;
(2.9) class number being assigned using each document provides cluster result as standard;
Symbol used in formula and meaning are as follows:
2. according to the method described in claim 1, it is characterized in that, in sampling process, it is for form Calculating, if between two-wheeled iteration only have a small amount of f (nkw) value change, then only for nkwValue becomes The w of change is calculated.
CN201711290927.3A 2017-12-08 2017-12-08 Fast text clustering method on large corpus Active CN108228721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711290927.3A CN108228721B (en) 2017-12-08 2017-12-08 Fast text clustering method on large corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711290927.3A CN108228721B (en) 2017-12-08 2017-12-08 Fast text clustering method on large corpus

Publications (2)

Publication Number Publication Date
CN108228721A true CN108228721A (en) 2018-06-29
CN108228721B CN108228721B (en) 2021-06-04

Family

ID=62653406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711290927.3A Active CN108228721B (en) 2017-12-08 2017-12-08 Fast text clustering method on large corpus

Country Status (1)

Country Link
CN (1) CN108228721B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829164A (en) * 2019-02-01 2019-05-31 北京字节跳动网络技术有限公司 Method and apparatus for generating text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582080A (en) * 2009-06-22 2009-11-18 浙江大学 Web image clustering method based on image and text relevant mining
CN102831119A (en) * 2011-06-15 2012-12-19 日电(中国)有限公司 Short text clustering equipment and short text clustering method
CN103714171A (en) * 2013-12-31 2014-04-09 深圳先进技术研究院 Document clustering method
CN103870840A (en) * 2014-03-11 2014-06-18 西安电子科技大学 Improved latent Dirichlet allocation-based natural image classification method
US20150039617A1 (en) * 2013-08-01 2015-02-05 International Business Machines Corporation Estimating data topics of computers using external text content and usage information of the users

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582080A (en) * 2009-06-22 2009-11-18 浙江大学 Web image clustering method based on image and text relevant mining
CN102831119A (en) * 2011-06-15 2012-12-19 日电(中国)有限公司 Short text clustering equipment and short text clustering method
US20150039617A1 (en) * 2013-08-01 2015-02-05 International Business Machines Corporation Estimating data topics of computers using external text content and usage information of the users
CN103714171A (en) * 2013-12-31 2014-04-09 深圳先进技术研究院 Document clustering method
CN103870840A (en) * 2014-03-11 2014-06-18 西安电子科技大学 Improved latent Dirichlet allocation-based natural image classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
康铁钢等: "《一种基于大规模标注语料库的词语聚类方法》", 《系统仿真学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829164A (en) * 2019-02-01 2019-05-31 北京字节跳动网络技术有限公司 Method and apparatus for generating text

Also Published As

Publication number Publication date
CN108228721B (en) 2021-06-04

Similar Documents

Publication Publication Date Title
Cai et al. Deeplearning model used in text classification
Carpenter LingPipe for 99.99% recall of gene mentions
CN102043851A (en) Multiple-document automatic abstracting method based on frequent itemset
CN109508374B (en) Text data semi-supervised clustering method based on genetic algorithm
CN107633000B (en) Text classification method based on tfidf algorithm and related word weight correction
CN107066555A (en) Towards the online topic detection method of professional domain
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN107357895B (en) Text representation processing method based on bag-of-words model
CN102243641A (en) Method for efficiently clustering massive data
CN112347246B (en) Self-adaptive document clustering method and system based on spectrum decomposition
CN107992549B (en) Dynamic short text stream clustering retrieval method
Matusevych et al. Hokusai-sketching streams in real time
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN105760875A (en) Binary image feature similarity discrimination method based on random forest algorithm
Li et al. Quantization algorithms for random fourier features
CN108228721A (en) Fast text clustering method on large corpora
CN112182337B (en) Method for identifying similar news from massive short news and related equipment
Sun et al. Chinese microblog sentiment classification based on convolution neural network with content extension method
CN111651660A (en) Method for cross-media retrieval of difficult samples
CN111091001A (en) Method, device and equipment for generating word vector of word
Chadha et al. Differentially Private Heavy Hitter Detection using Federated Analytics
CN109902169B (en) Method for improving performance of film recommendation system based on film subtitle information
Graham et al. Small sample methods
Verma et al. Variance reduction in feature hashing using MLE and control variate method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant