CN108228721A - Fast text clustering method on large corpora - Google Patents
Fast text clustering method on large corpora Download PDFInfo
- Publication number
- CN108228721A CN108228721A CN201711290927.3A CN201711290927A CN108228721A CN 108228721 A CN108228721 A CN 108228721A CN 201711290927 A CN201711290927 A CN 201711290927A CN 108228721 A CN108228721 A CN 108228721A
- Authority
- CN
- China
- Prior art keywords
- document
- cluster
- index
- result
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to relational database technology field, the fast text clustering method on specially a kind of large corpora.Since text data usually has higher-dimension and sparse feature, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and the method such as multinomial mixed model of Di Li Crays based on generation model is phenomenologically more prominent.By the present invention in that the symmetrical priori and construction that are distributed with Di Li Crays index to optimize, make to only rely upon the number of various words in document total time, thus also can Effec-tive Function in the longer document of length.
Description
Technical field
The invention belongs to relational database technology fields, and in particular to the fast text cluster side on a kind of large corpora
Method.
Background technology
Text cluster is a kind of FAQs in data mining, is the important hand effectively organized to text message
Section, plays an important role in the research of natural language processing etc..
Since text data is only made of word, with it is other by extraction characteristics compared with, usual dimension higher and
More sparse, the clustering method for being based purely on data similarity is difficult to obtain preferable effect, and based on the method for generation model
As the multinomial mixed model of Di Li Crays is phenomenologically more prominent.
However the multinomial mixed model the time it takes of Di Li Crays is directly proportional to Document Length, for large corpora
Speech, document therein is often larger, causes convergence rate not ideal enough, has influenced whole data-handling efficiency.
Invention content
The purpose of the present invention is for large corpora, propose it is a kind of the method for text cluster quickly is carried out to it, so as to
In subsequent data processing.
The method that the present invention proposes the fast text cluster on large corpora, is as follows:
1st, the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first, referring to figure
Shown in 2.
In the index, i-th of element aiValue beAfter establishing index a kind of in this way,Value can be completed by single divisionBy the complexity of calculating
Degree falls below O (1) from O (n).
2nd, total classification number K in hyper parameter α, β and cluster process is provided by user, uses Di based on gibbs sampler
The sharp multinomial mixed model of Cray, infers the class number belonging to each document, detailed process is:
2.1st, for any document in corpusA class number z is randomly assigned for iti;
2.2nd, all documents are traversed, and according to the current class situation of documents other in corpus, according to
Di Li Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as:
Distribution formula derivation is as follows:
Symbol used in formula and meaning are as follows:
Result of the distribution formula after deriving and simplifying is as follows:
After being optimized by index, O (1) is dropped to the computation complexity of denominator, the computation complexity of molecule is proportional to
Unduplicated word number in document;
2.3, for the distribution p (x) for needing to sample, choose the motion distribution q (x) that is more easy to sample and meet following property:If
For the i-th step with q (xi|xi-1) transition probability structure Markov chain, shift each shape probability of state after enough multisteps
Convergence in distribution is in p (x);
2.4 samplings obtain initial sample x0~q (x);
2.5 samplings obtain xcand~q (xcand|xi-1), calculating acceptance probability is:
And this sampled result is received with above-mentioned probability, even xiFor xcand, do not receive, then enable xiFor xi-1;
In sampling process, used since motion distribution can take turns iteration for n, when n is larger, alias method can be passed through
The time complexity for obtaining n sample is shared equally O (1) by O (K);
Equally in the process, when the cluster number of some document between two-wheeled iteration does not change, we
There is xcand=xi, do not need at this time calculate acceptance probability value, so as to accelerate sampling process;
2.6 repeat step 2.5 to predetermined number of times;
2.7 return to current xiAs sampled result;
2.8 repeat step 2.2-2.7, until convergence;
As a kind of optimization, in sampling process, it is for formCalculating, if changing in two-wheeled
Only a small amount of f (n between generationkw) value change, then can be only for nkwIt is worth changed w to be calculated, to reduce sampling
The time complexity of process.Wherein, generally using 20% as threshold value on a small quantity, can specifically be adjusted according to actual conditions,
Theoretically there is effect of optimization less than 50%, i.e., a small amount of range can be 20%--50%;
2.9 class numbers being assigned using each document provide cluster result as standard.
By the present invention in that the symmetrical priori and construction index that are distributed with Di Li Crays optimize, make total time only according to
Rely the number of the various words in document, thus also can Effec-tive Function in the longer document of length.
Description of the drawings
Fig. 1 is the graph model of the multinomial mixing of Di Li Crays in this method.
Fig. 2 is the process for sequentially establishing index.
Specific embodiment
For the convenience of description, hereinafter the fast text clustering method of tape index optimization is referred to as IGSDMM by us.
Advantage of the present invention relative to existing clustering algorithm will be introduced by taking two datasets as an example.The introduction of data set is such as
Under:
NG20.The data set contains the 18 of 20 western mainstream newsgroups, 846 documents.This is a classical use
To weigh the method for Text Clustering Algorithm.The average length of document is 137.85 in NG20, and average vocabulary number is 91.
Tweet.The data set pushes away spy by 2472 and forms, and related to 89 inquiries.Push away the special pass between inquiry
System is by manually marking.It is 8.56 to push away special average length, and average vocabulary number is 7.
Standardization mutual information (NMI) is widely used in the quality for weighing cluster result.NMI measures representative cluster point
The statistical information shared between the true class label of stochastic variable and document matched.The formal definition of NMI is as follows:
Wherein, ncIt is the quantity of document in classification c, nkIt is the quantity of document in cluster k, nc,kIt is not only in classification c but also in cluster k
Document quantity, N is then the quantity of document in data set.When cluster result and authentic signature perfect matching, NMI is exactly 1.
It is randomly generated when cluster result, NMI values can be closer to 0.
The performance of IGSDMM is compared by we with K-means and LDA.The operating parameter of IGSDMM is set as, α and β
0.1 is set as, iterations are set as 30.Operation result is as shown in the table:
In comparison process, also K values are set, are 0.5 times of true K values respectively, 1 times, 2 times, 3 times.Meanwhile
In order to which the randomness for ensureing algorithm does not influence comparison result, each numerical value takes after running 10 times and is worth to,
From table it can be seen that, for the performance of ng20 or tweet, IGSDMM be better than other two it is classical
Clustering algorithm.Meanwhile it can be observed that for the algorithm, K settings it is larger when, performance can also increase.And for
The relatively conference of K-means, K value setting leads to the decline of performance.
For the algorithm, one larger K value of setting will not only lead to the decline of performance, also result in better property
Energy.This is because the algorithm can speculate the quantity of a suitable cluster automatically.So the quantity of one larger cluster of setting is that have
It is necessary.
Claims (2)
1. the method for the fast text cluster on a kind of large corpora, which is characterized in that be as follows:
(1) the text data set D being made of large volume document is given, multiplies calculating structure index for subsequent company first;
In the index, i-th of element aiValue beAfter establishing index a kind of in this way,Value completed by single division
(2) total classification number K in hyper parameter α, β and cluster process is provided by user, uses the Di Li based on gibbs sampler
The multinomial mixed model of Cray infers the class number belonging to each document, detailed process is:
(2.1) for any document in corpusA class number z is randomly assigned for iti;
(2.2), all documents are traversed, and according to the current class situation of documents other in corpus, according to Di
Sharp Cray Posterior distrbutionp formula, the classification belonging to sampling update document i, that obeys is distributed as:
Result after distribution formula is simplified is as follows:
(2.3) distribution p (x) sampled for needs chooses the motion distribution q (x) that is more easy to sample and meet following property:It is if right
In the i-th step with q (xi|xi-1) transition probability structure Markov chain, shift each shape probability of state point after enough multisteps
Cloth converges on p (x);
(2.4) sampling obtains initial sample x0~q (x);
(2.5) sampling obtains xcand~q (xcand|xi-1), calculating acceptance probability is:
And this sampled result is received with above-mentioned probability, even xiFor xcand, do not receive, then enable xiFor xi-1;
(2.6) step (2.5) is repeated to predetermined number of times;
(2.7) current x is returnediAs sampled result;
(2.8) step (2.2-2.7) is repeated, until convergence;
(2.9) class number being assigned using each document provides cluster result as standard;
Symbol used in formula and meaning are as follows:
2. according to the method described in claim 1, it is characterized in that, in sampling process, it is for form Calculating, if between two-wheeled iteration only have a small amount of f (nkw) value change, then only for nkwValue becomes
The w of change is calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711290927.3A CN108228721B (en) | 2017-12-08 | 2017-12-08 | Fast text clustering method on large corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711290927.3A CN108228721B (en) | 2017-12-08 | 2017-12-08 | Fast text clustering method on large corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108228721A true CN108228721A (en) | 2018-06-29 |
CN108228721B CN108228721B (en) | 2021-06-04 |
Family
ID=62653406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711290927.3A Active CN108228721B (en) | 2017-12-08 | 2017-12-08 | Fast text clustering method on large corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108228721B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829164A (en) * | 2019-02-01 | 2019-05-31 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating text |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582080A (en) * | 2009-06-22 | 2009-11-18 | 浙江大学 | Web image clustering method based on image and text relevant mining |
CN102831119A (en) * | 2011-06-15 | 2012-12-19 | 日电(中国)有限公司 | Short text clustering equipment and short text clustering method |
CN103714171A (en) * | 2013-12-31 | 2014-04-09 | 深圳先进技术研究院 | Document clustering method |
CN103870840A (en) * | 2014-03-11 | 2014-06-18 | 西安电子科技大学 | Improved latent Dirichlet allocation-based natural image classification method |
US20150039617A1 (en) * | 2013-08-01 | 2015-02-05 | International Business Machines Corporation | Estimating data topics of computers using external text content and usage information of the users |
-
2017
- 2017-12-08 CN CN201711290927.3A patent/CN108228721B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582080A (en) * | 2009-06-22 | 2009-11-18 | 浙江大学 | Web image clustering method based on image and text relevant mining |
CN102831119A (en) * | 2011-06-15 | 2012-12-19 | 日电(中国)有限公司 | Short text clustering equipment and short text clustering method |
US20150039617A1 (en) * | 2013-08-01 | 2015-02-05 | International Business Machines Corporation | Estimating data topics of computers using external text content and usage information of the users |
CN103714171A (en) * | 2013-12-31 | 2014-04-09 | 深圳先进技术研究院 | Document clustering method |
CN103870840A (en) * | 2014-03-11 | 2014-06-18 | 西安电子科技大学 | Improved latent Dirichlet allocation-based natural image classification method |
Non-Patent Citations (1)
Title |
---|
康铁钢等: "《一种基于大规模标注语料库的词语聚类方法》", 《系统仿真学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829164A (en) * | 2019-02-01 | 2019-05-31 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating text |
Also Published As
Publication number | Publication date |
---|---|
CN108228721B (en) | 2021-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cai et al. | Deeplearning model used in text classification | |
Carpenter | LingPipe for 99.99% recall of gene mentions | |
CN102043851A (en) | Multiple-document automatic abstracting method based on frequent itemset | |
CN109508374B (en) | Text data semi-supervised clustering method based on genetic algorithm | |
CN107633000B (en) | Text classification method based on tfidf algorithm and related word weight correction | |
CN107066555A (en) | Towards the online topic detection method of professional domain | |
CN109993216B (en) | Text classification method and device based on K nearest neighbor KNN | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN102243641A (en) | Method for efficiently clustering massive data | |
CN112347246B (en) | Self-adaptive document clustering method and system based on spectrum decomposition | |
CN107992549B (en) | Dynamic short text stream clustering retrieval method | |
Matusevych et al. | Hokusai-sketching streams in real time | |
CN111325033B (en) | Entity identification method, entity identification device, electronic equipment and computer readable storage medium | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN105760875A (en) | Binary image feature similarity discrimination method based on random forest algorithm | |
Li et al. | Quantization algorithms for random fourier features | |
CN108228721A (en) | Fast text clustering method on large corpora | |
CN112182337B (en) | Method for identifying similar news from massive short news and related equipment | |
Sun et al. | Chinese microblog sentiment classification based on convolution neural network with content extension method | |
CN111651660A (en) | Method for cross-media retrieval of difficult samples | |
CN111091001A (en) | Method, device and equipment for generating word vector of word | |
Chadha et al. | Differentially Private Heavy Hitter Detection using Federated Analytics | |
CN109902169B (en) | Method for improving performance of film recommendation system based on film subtitle information | |
Graham et al. | Small sample methods | |
Verma et al. | Variance reduction in feature hashing using MLE and control variate method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |