CN102955857B

CN102955857B - Class center compression transformation-based text clustering method in search engine

Info

Publication number: CN102955857B
Application number: CN201210447277.XA
Authority: CN
Inventors: 欧阳元新; 袁满; 谢舒翼; 刘文琦; 熊璋
Original assignee: Beihang University
Current assignee: Zhuhai haotengzhisheng Technology Co., Ltd
Priority date: 2012-11-09
Filing date: 2012-11-09
Publication date: 2015-07-08
Anticipated expiration: 2032-11-09
Also published as: CN102955857A

Abstract

The invention discloses a class center compression transformation-based text clustering method in a search engine. The method comprises the following steps of: by using an improved tf-idf formula, calculating word weight of each file in a text set, calculating an initial class center, mining a synonym word set and a concurrent high-frequency word set, calculating a word center and performing primary classification according to similarity of the initial class center with each file; compressing the center word according to information such as title word, article length, synonyms and concurrent associated words, thereby guaranteeing that the same word only occurs in some class centers with high similarity with the word; clustering the file by using a new cluster center again; calculating core similarity of each class; splitting the biggest class; combining smaller classes to produce a new class; iterating compression, clustering and split operation until the number of the classes converges; and guaranteeing that the similarity of the text in the same class with the cluster center reaches a certain threshold value. The clustering accuracy is obviously higher than those of the conventional methods such as KMeans and DBSCAN (Density-based Spatial Clustering of Applications with Noise).

Description

Based on the Text Clustering Method of class central compressed conversion in a kind of search engine

Technical field

The invention belongs to text mining, the technical field of machine learning research, in particular in a kind of search engine based on class central compressed conversion Text Clustering Method, by in conjunction with synonym phrase, co-occurrence association phrase, vocabulary center, class center, title content, the many factors such as Document Length, carry out cluster repeatedly, division alternative manner to improve clustering precision to text set.The method is applicable to search engine, information retrieval system.

Background technology

In real world, text is the most important carrier of information, and in fact, research shows that information has 80% to be included in text document.Particularly on the internet, text data is present in various forms widely, as news report, e-book, research paper, digital library, webpage, Email etc.Text cluster technology can be applied to information filtering, personalized information recommendation, enables people retrieve required information exactly, shortens the time of information retrieval.Meanwhile, text cluster is a kind of method not needing training set can mark off generic, and it effectively can solve the automatic partition problem of text.Text cluster, owing to not needing to mark classification to text is manual in advance, therefore has certain dirigibility and higher automatic business processing ability, has become the important means effectively organized text message, make a summary and navigate.

Current existing Text Clustering Method major part is based on VSM(text vector model) model calculates similarity between text and text, is mutually independently when structure text vector between hypothesis word.This method have ignored the relevance between same section document word and word, the potential contact etc. between different document word and word.Traditional Clustering Model is confined to the input sequence of document, the number of initial classes, the restriction of the multiple conditions such as the selection of initial central point.Position cluster between word and synon excavation are also the contents that conventional Text Clustering Method is ignored.Thus the calculating of Documents Similarity is affected, and causes the result of cluster accurate not.Therefore, the method that this patent proposes is by the feature extraction keyword for data set, remove insignificant vocabulary, the vocabulary that the filtering effects factor is less, excavates document subject matter, synonym phrase, the potential applications relations such as co-occurrence high-frequency phrase improve clustering precision, by center of compression vocabulary, the tf-idf method that utilization improves calculates the similar weight between vocabulary, and the method for iteration cluster and the new class of division eliminates the impact of document input sequence.Finally reach and make similar text similarity as far as possible large, inhomogeneity text similarity is as far as possible little.

Summary of the invention

The technical problem to be solved in the present invention is: the limitation overcoming prior art, a kind of Text Clustering Method based on the conversion of class central compressed is provided, the method excavates document subject matter, synonym phrase, the potential applications relations such as co-occurrence high-frequency phrase, adopt class central compressed, center reunion class, divide the conversion such as new class, improve text cluster precision.

The technical scheme that the present invention solves the problems of the technologies described above is: based on the Text Clustering Method of class central compressed conversion in a kind of search engine, the method comprises the following steps:

Step 1, participle is carried out to each text in cluster text set;

Step 2, removal stop words, the word that the filtering effects factor is less;

Step 3, calculate the number of times tf that in each text, each word occurs;

The anti-text frequency of step 4, calculating word wherein fileNum is the sum of text, and freOccur is the amount of text occurring this word);

Step 5, excavation synonym phrase;

Step 6, excavation co-occurrence high-frequency phrase, namely appear at the phrase pair in multiple different text simultaneously;

Step 7, according to synonym phrase and high frequency co-occurrence phrase, produce original class center, each class center is made up of a series of high frequency vocabulary, tf and idf of statistics high frequency vocabulary, the class center belonging to mark high frequency vocabulary;

Step 8, calculate the content-length of each text, extract the title of article, participle is carried out to title; If do not have title, then title title is set to sky; Extraction section head-word language and section tail vocabulary are also marked so that weighted calculation below;

Step 9, calculate similarity between any two texts, have in title or content during the word of identical or synonym and increase weight, section head-word language gives different weights respectively from section tail vocabulary, and computing formula is as follows:

pureFileSim _(i，j)＝(contentSimilarity _(i，j)+titleSimilarity _(i，j)/(log(fileLength _i*fileLength _j))；

{contentSimilarity}_{(i, j)} = \underset{x, y}{Σ} \begin{matrix} (\log ({fileKeywodTf}_{(i, x)}) + 1) * {fileKeywodIdf}_{(i, x)} * &PartialD; \\ + (\log ({fileKeywodTf}_{(j, y)}) + 1) * {fileKeywodIdf}_{(j, y)} * &PartialD; \end{matrix};

{titleSimilarity}_{(i, j)} = \underset{x, y}{Σ} \begin{matrix} ({fileTitleWordTf}_{(i, x)} * {fileTitleWordIdf}_{(i, x)}) * &PartialD; \\ + ({fileTitleWordTf}_{(j, y)} * {fileTitleWordIdf}_{(j, y)}) * &PartialD; \end{matrix};

In formula: pureFileSim _{(i, j)}: the pure similarity of text i and text j;

ContentSimilarity _{(i, j)}: the content similarity of text i and text j;

TitleSimilarity _{(i, j)}: the title similarity of text i and text j;

FileKeywordTf _{(x, i)}: the tf of key word x in text i;

FileKeywordIdf _{(x, i)}: the idf of key word x in text i;

FileTitleWordTf _{(j, y)}: the tf of class center j keyword y;

FileTitleWordIdf _{(j, y)}: the idf of class center j keyword y;

FileLengthi: the content-length of text i;

The input sequence of step 10, randomization text: initial clustering is carried out to cluster text set according to original cluster centre, its algorithm is as follows: to each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text; The calculating formula of similarity of text i and class center j is as follows:

{fileSim}_{(i, j)} = (\underset{{fileKeyword}_{(i, x)} &Element; {center}_{j}}{Σ} (\log ({fileKeywordIf}_{(x, i)}) + 1) * {fileKeywordIdf}_{(x, i)}

+ \underset{{fileTitleWord}_{(i, x)} &Element; {center}_{j}}{Σ} (\log ({centerKeywordTf}_{(j, y)}) + 1) * ({centerKeywordIdf}_{(j, y)}) / {fileContentLength}_{i};

In formula:

FileKeywordTf _{(x, i)}: the tf of key word x in text i;

FileKeywordIdf _{(x, i)}: the idf of key word x in text i;

CenterKeywordTf _{(j, y)}: the tf of class center j keyword y;

CenterKeywordIdf _{(j, y)}: the idf of class center j keyword y;

FileContentLength _i: the content-length of text i;

Calculate and the immediate class center of each vocabulary simultaneously, record the wordid of vocabulary;

Calculate the number percent that the most similar class center has more than class center like second-phase, be recorded in the diffRatio of text;

The text that step 11, rejecting diffRatio are less than 10%, carries out keyword extraction and statistics to the text set belonging to same class, utilizes these vocabulary to regenerate such center in remaining text; Selected vocabulary requires that tf and idf is not less than certain threshold value; More the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart, merges similarity comparatively Gao Lei center;

Step 12, recalculate the cluster centre belonging to each text according to new cluster centre, Similarity Measure is with step 9;

Step 13, calculate the core similarity of each class, attempt dividing to produce new class to maximum class, its splitting-up method is as follows: calculate most active text fx in such, namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger, the text fy minimum with text fx similarity is calculated in class, Xin Lei center ctx is set up with fx and the text set the most similar to fx, Xin Lei center cty is set up with fy and the text set the most similar to fy, itself and ctx are calculated to text remaining in such, the similarity of cty, one of both they are incorporated to respectively,

Step 14, in the less text of the basic Shang Duiyulei center similarity of step 11, the center id according to its most of vocabulary is incorporated to the class belonging to this id;

Step 15. repeats step 10-14, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value, then stop.

Principle of the present invention is:

Based on the Text Clustering Method of class central compressed conversion in a kind of search engine, improve traditional tf-idf(termfrequency – inverse document frequency) formula, calculate the term weight of each document in text set, initial classes center is produced by large data sets, excavate synonym phrase and co-occurrence high-frequency phrase, calculate vocabulary center, the pure similarity between document and association similarity, the similarity according to initial classes center and each document carries out just subseries.According to title vocabulary, article length, synonym, co-occurrence conjunctive word, vocabulary center, the information such as number percent that the most similar class center has more than class center like second-phase, center of compression vocabulary, make same vocabulary only to appear in some high classes similar to it at heart, utilize new cluster centre to carry out cluster again to document sets.Calculate the core similarity of each class, divide to produce new class to maximum class.To compression, cluster, splitting operation carries out iteration, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value.

The present invention's advantage is compared with prior art:

In the research field of text cluster, KMeans(K Mean Method: the number k of the given division that will build, first division methods creates an initial division.Then a kind of re-positioning technology of iteration is adopted, attempt improving division by object is mobile between division), DBSCAN(will bunch to regard as in data space by high density subject area that density regions separates, for each text object in bunch, the text object number comprised in the field of its given radius (representing with ￡) is not less than a certain given minimal amount (representing with MinPts), as long as the density of close region (number of object) exceedes certain threshold value, just continue cluster) be the method relatively commonly used.But all there are some shortcomings, as KMeans method affects by initial cluster center comparatively large, cluster numbers k will be specified in advance, easily be subject to the impact of isolated point and file input sequence.And DBSCAN method, although can find arbitrary shape bunch, responsive to parameter ￡ and MinPts, easily the text of same type is divided into multiple different class.The Clustering Effect that traditional method produces is not ideal.

Instant invention overcomes the shortcoming displayed in classic method, do not need to specify clusters number in advance, the input sequence of document does not affect cluster result, insensitive to parameters such as similar radial.

Accompanying drawing explanation

Fig. 1 is the Text Clustering Method block diagram based on the conversion of class central compressed;

Fig. 2 is the Text Clustering Method structural drawing based on the conversion of class central compressed;

Fig. 3 is the variation diagram along with initial K value rising KMeans method and class central compressed transform method clustering precision;

Fig. 4 is along with class center similarity radius increases, the variation diagram of DBSCAN method and class central compressed transform method clusters number;

Fig. 5 is KMeans method, the comparison diagram of DBSCAN method and the average clustering precision of class central compressed transform method.

Embodiment

Below in conjunction with drawings and Examples, the present invention is further described.

A kind of Text Clustering Method based on the conversion of class central compressed of the present invention, it fully excavates the potential applications association between text vocabulary, calculates vocabulary center, and compression class center, improves the precision of text cluster.The similarity of compute classes center and text, iteration Abruption and mergence, recombination classes center, until meet certain standard.Potential applications association between described excavation text vocabulary, the tf-idf that utilization improves calculates the similarity between text, in this, as the important indicator weighing the degree of association between text vocabulary.Extract the title of every section of document simultaneously and carry out participle, the similarity of title vocabulary is weighted.

tf _new＝log(tf)+1

wherein fileNum is the sum of text, and freOccur is the amount of text occurring this word;

Potential applications association between described excavation text vocabulary, excavate synonym phrase, co-occurrence high-frequency phrase (jointly appearing in many sections of documents) improves the computational accuracy of Lexical Similarity, and in this, as initial words clustering center.

Described calculating vocabulary center, utilizes vocabulary to appear at the number of times of same section document, appears at the frequency of different document, and with the word of its nearly justice, the features such as co-occurrence relative words, carry out key words sorting to vocabulary, calculate the immediate vocabulary center of vocabulary.

Described compression class center, more the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart.

Described compute classes center and the similarity of text, the input sequence of randomization text.Similarity is calculated according to the information after cluster centre and Text Pretreatment.To each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text.

Described division class center, calculates the core similarity of each class, attempts dividing to produce new class to maximum class.Splitting-up method is as follows: calculate most active text fx in such, and namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger.The text minimum with text fx similarity is calculated in class.Set up Xin Lei center ctx with fx and the text set the most similar to fx, set up Xin Lei center cty with fy and the text set the most similar to fy.Text remaining in such is calculated to the similarity of itself and ctx, cty, they are incorporated to respectively both one of.

Described merging class center, the similarity between compute classes center, class similarity being reached certain standard merges, and utilizes the vocabulary of these classes to regenerate such center.Selected vocabulary requires that tf and idf is not less than certain threshold value.

Described iterative operation, multiple division class center, recombination classes center, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value.

Following 15 steps are mainly divided into based on the Text Clustering Method of class central compressed conversion in a kind of search engine of the present invention.

1. each text in pair cluster text set carries out participle.

2. remove stop words, the word that the filtering effects factor is less.

3. calculate the number of times tf that in each text, each word occurs.

4. calculate the anti-text frequency idf of word.

5. excavate synonym phrase.

6. excavate co-occurrence high-frequency phrase, namely appear at the phrase pair in multiple different text simultaneously.

7. according to synonym phrase and high frequency co-occurrence phrase, produce original class center, each class center is made up of a series of high frequency vocabulary, tf and idf of statistics high frequency vocabulary, the class center belonging to mark high frequency vocabulary.

8. calculate the content-length of each text, extract the title (if do not have title, then title is set to sky) of article, participle is carried out to title; Extraction section head-word language and section tail vocabulary are also marked so that weighted calculation below.

9. calculate the similarity (having word that is identical or synonym just to increase weight in title or content) between any two texts, section head-word language gives different weights respectively from section tail vocabulary.Computing formula:

{contentSimilarity}_{(i, j)} = \underset{x, y}{Σ} \begin{matrix} (\log ({fileKeywodTf}_{(i, x)}) + 1) * {fileKeywodIdf}_{(i, x)} * &PartialD; \\ + (\log ({fileKeywodTf}_{(j, y)}) + 1) * {fileKeywodIdf}_{(j, y)} * &PartialD; \end{matrix};

{titleSimilarity}_{(i, j)} = \underset{x, y}{Σ} \begin{matrix} ({fileTitleWordTf}_{(i, x)} * {fileTitleWordIdf}_{(i, x)}) * &PartialD; \\ + ({fileTitleWordTf}_{(j, y)} * {fileTitleWordIdf}_{(j, y)}) * &PartialD; \end{matrix};

In formula: pureFileSim _{(i, j)}: the pure similarity of text i and text j;

ContentSimilarity _{(i, j)}: the content similarity of text i and text j;

TitleSimilarity _{(i, j)}: the title similarity of text i and text j;

FileKeywordTf _{(x, i)}: the tf of key word x in text i;

FileKeywordIdf _{(x, i)}: the idf of key word x in text i;

FileTitleWordTf _{(j, y)}: the tf of class center j keyword y;

FileTitleWordIdf _{(j, y)}: the idf of class center j keyword y;

FileLengthi: the content-length of text i.

10. the input sequence of randomization text.According to original cluster centre, initial clustering is carried out to cluster text set.Algorithm is as follows: to each section of text, calculates the similarity of it and all cluster centres, and the cluster centre id selecting similarity maximum is as the class belonging to this text.The calculating formula of similarity of text i and class center j is as follows:

{fileSim}_{(i, j)} = (\underset{{fileKeyword}_{(i, x)} &Element; {center}_{j}}{Σ} (\log ({fileKeywordIf}_{(x, i)}) + 1) * {fileKeywordIdf}_{(x, i)}

+ \underset{{fileTitleWord}_{(i, x)} &Element; {center}_{j}}{Σ} (\log ({centerKeywordTf}_{(j, y)}) + 1) * ({centerKeywordIdf}_{(j, y)}) / {fileContentLength}_{i};

In formula:

FileKeywordTf _{(x, i)}: the tf of key word x in text i

FileKeywordIdf _{(x, i)}: the idf of key word x in text i

CenterKeywordTf _{(j, y)}: the tf of class center j keyword y

CenterKeywordIdf _{(j, y)}: the idf of class center j keyword y

FileContentLength _i: the content-length of text i

Calculate and the immediate class center of each vocabulary simultaneously, record the wordid of vocabulary.

Calculate the number percent that the most similar class center has more than class center like second-phase, be recorded in the diffRatio of text.

The text that 11. rejecting diffRatio are less than 10%, carries out keyword extraction and statistics to the text set belonging to same class, utilizes these vocabulary to regenerate such center in remaining text.Selected vocabulary requires that tf and idf is not less than certain threshold value.More the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart.Merge similarity comparatively Gao Lei center.

12. recalculate the cluster centre belonging to each text according to new cluster centre, and Similarity Measure is with step 9.

The core similarity of each class of 13. calculating, attempts dividing to produce new class to maximum class.Splitting-up method is as follows: calculate most active text fx in such, and namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger.The text fy minimum with text fx similarity is calculated in class.Set up Xin Lei center ctx with fx and the text set the most similar to fx, set up Xin Lei center cty with fy and the text set the most similar to fy.Text remaining in such is calculated to the similarity of itself and ctx, cty, they are incorporated to respectively both one of.

14. in the less text of the basic Shang Duiyulei center similarity of step 11, and the center id according to its most of vocabulary is incorporated to the class belonging to this id.

15. repeat step 10-14, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value, termination.

Claims

1. in search engine based on class central compressed conversion a Text Clustering Method, it is characterized in that: the method comprises the following steps:

Step 1, participle is carried out to each text in cluster text set;

Step 2, removal stop words, the word that the filtering effects factor is less;

Step 3, calculate the number of times tf that in each text, each word occurs;

The anti-text frequency idf of step 4, calculating word;

Step 5, excavation synonym phrase;

pureFileSim _(i,j)＝(contentSimilarity _(i,j)+titleSimilarity _(i,j))/(log(fileLength _i*fileLength _j))；

\begin{matrix} {contentSimilarity}_{(i, j)} = Σ [(\log ({fileKeywodTf}_{(i, x)}) + 1) * {fileKeywodIdf}_{(i, x)} * &PartialD; \\ + (\log ({fileKeywodTf}_{(j, y)}) + 1) * {fileKeywodIdf}_{(j, y)} * &PartialD; \end{matrix};

\begin{matrix} {titleSimilarity}_{(i, j)} = \underset{x, y}{Σ} [({fileTitleWordTf}_{(i, x)} * {fileTitleWordIdf}_{(i, x)} &PartialD;) \\ + ({fileTitleWordTf}_{(j, y)} * {fileTitleWordIdf}_{(j, y)} * &PartialD;) \end{matrix};

In formula: pureFileSim _{(i, j)}: the pure similarity of text i and text j;

ContentSimilarity _{(i, j)}: the content similarity of text i and text j;

TitleSimilarity _{(i, j)}: the title similarity of text i and text j;

FileKeywordTf _{(x, i)}: the tf of key word x in text i;

FileKeywordIdf _{(x, i)}: the idf of key word x in text i;

FileTitleWordTf _{(j, y)}: the tf of text j heading y;

FileTitleWordIdf _{(j, y)}: the idf of text j heading y;

FileLength _i: the content-length of text i;

The input sequence of step 10, randomization text: initial clustering is carried out to cluster text set according to original cluster centre, its algorithm is as follows: to each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text; The calculating formula of similarity of text i and class center m is as follows:

\begin{matrix} {fileSim}_{(i, m)} = (\underset{f {ileKeyword}_{(i, x)} &Element; {center}_{m}}{Σ} (\log ({fileKeywordTf}_{(x, i)}) + 1) * {fileKeywordIdf}_{(x, i)} \\ + \underset{fileTitle {Word}_{(i, x)} &Element; {center}_{m}}{Σ} (\log ({centerKeywordTf}_{(m, y)}) + 1) * ({centerKeywordIdf}_{(m, y)})) / {fileContentLength}_{i} \end{matrix};

In formula:

FileKeywordTf _{(x, i)}: the tf of key word x in text i;

FileKeywordIdf _{(x, i)}: the idf of key word x in text i;

CenterKeywordTf _{(m, y)}: the tf of class center m keyword y;

CenterKeywordIdf _{(m, y)}: the idf of class center m keyword y;

FileContentLength _i: the content-length of text i;