CN101694670A

CN101694670A - Chinese Web document online clustering method based on common substrings

Info

Publication number: CN101694670A
Application number: CN200910236138A
Authority: CN
Inventors: 张辉; 王德庆; 王晗; 杨高
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2009-10-20
Filing date: 2009-10-20
Publication date: 2010-04-14
Anticipated expiration: 2029-10-20
Also published as: CN101694670B

Abstract

The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.

Description

A kind of Chinese Web document online clustering method based on public substring

Technical field

The invention belongs to technical field of information processing, is a kind of data digging method, is specifically related to a kind of Web document online clustering method.

Background technology

Cluster process comes down to a mapping process.If given object set O={o ₁, o ₂..., o _n, class set is π={ c ₁, c ₂..., c _m, then cluster is following mapping:

And satisfy:

(1), c_{i} &SubsetEqual; O (i = 1,2, . . ., t)

(2), \cup_{i = 1}^{t} c_{i} = O

Along with popularizing day by day of internet, the increasing sharply of the network information, traditional search engine tends to return a large amount of Search Results and makes the user be difficult to find the own information that really needs.The Web clustering documents can address this problem preferably, and it presses classifying content with the return results of search engine.Like this, thus the user just can dwindle the scope of selecting finds information of interest fast.

The Web clustering documents is a kind of guideless document classification, and it is divided into several bunches (subclass) with a document sets, and is big as much as possible with the similarity of document content in the cluster, and the similarity of document content is as much as possible little between different bunches.Compare general cluster, the online cluster of Web document has two characteristics: the one, and cluster has nonumeric type and non-structured characteristics to liking the Web document; The 2nd, the cluster time will be satisfied the requirement of user's online retrieving, thereby algorithm should have the characteristics of real-time and interactivity.

The research of Web clustering documents mainly contains three kinds of methods: based on the cluster of link, based on the cluster of text similarity and based on the cluster of user feedback.At present, more common search-engine results clustering method mainly is based on the clustering algorithm of document similarity.Based on the cluster thought of document similarity is that the document abstract representation is vector, and adopts vector angle cosine to represent similarity between document and the document, according to certain clustering algorithm (as K-means, STC) document is carried out cluster then.

Above-mentioned method is applicable to the english information searching system, and does not have between the word of Chinese must depend on Words partition system at interval, so above method is also bad for the effect of Chinese information retrieval.The present invention proposes a kind of Chinese Web clustering documents algorithm online, that need not Chinese word segmentation.

Summary of the invention

The technical problem to be solved in the present invention:

1, general Web document clustering method is applicable to the english information searching system at present, and does not have between the word of Chinese must depend on Words partition system, and the quality of dictionary to have fundamental influence to the cluster effect at interval.The present invention adopts no participle technique, can avoid the influence of dictionary, improves the cluster performance simultaneously;

2, the execution time of the online cluster of Web document will be satisfied the requirement of user's online retrieving, thereby requires algorithm should have stronger real-time and interactivity.

The technical solution used in the present invention:

The system handles flow process is divided following step: 1) Web document pre-service, realize operation is handled in the deletion and the replacement of non-Chinese character in the search engine return results; 2) utilize GSA to realize the extraction of public word string in the Web document, then with the feature of public word string as document; 3) calculate document to be clustered similarity between any two, form the document similarity matrix; 4) utilize similarity matrix, and use clustering algorithm that document is carried out cluster; 5) extraction of cluster description and label is promptly given the class label that can describe such to each classification, and this label can be summarized the content of this class, this class and other classes differences can be come again.

The beneficial effect that the present invention obtains:

Online clustering method has than clear superiority aspect performance, the generation of cluster label and the cluster chronergy:

1, with traditional text cluster system compares, the Chinese Web document online clustering method that this paper proposed does not need participle, but the method that adopts the GSA algorithm to extract public substring between the Web document is determined the feature of document, and then carries out cluster calculation as the proper vector in the clustering method.Solved the Web text as nonumeric type of cluster object and non-structured problem.

What 2, the present invention found the solution that public substring adopts between the character string is a mutation---the GSA algorithm of suffix tree (Suffix Tree) algorithm, and its time complexity is 0 (n), and space complexity is S (n).It is better than the suffix tree algorithm on space complexity.

3, traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering, has obtained cluster effect preferably.

4, the present invention uses the label of the public substring of weight maximum as cluster, not only can keep semantic component, and makes that the readability of cluster label is strong.

Below, with the effect of verifying that by experiment the present invention obtains:

The leading indicator of clustering algorithm comprises CH value, cluster label validity and cluster effect.

The CH function is defined as follows:

CH = \frac{traceB / (k - 1)}{traceW / (n - k)}

traceB = Σ_{j = 1}^{k} n_{j} {| | u_{j} - u | |}^{2}

traceW = Σ_{j = 1}^{K} Σ_{i = 1}^{n_{j}} | | x_{i} - u_{j} | |

Wherein, n _jBe j the amount of text in the cluster; u _jIt is the barycenter of j cluster; U is all barycenter that participate in the cluster text; x _iBe i text in corresponding certain cluster; K is the total number of cluster; N is the total number of text.The CH function is the comprehensive embodiment of distance and between class distance in the class in the cluster result, and the CH value is big more, represents the cluster effect good more.

Use five key words to retrieve in the experiment, following table is the model that proposes of this paper and the CH value comparison of Chinese word segmentation+tf*idf model:

Key word	Participle+tf*idf	Model based on public substring
Key word	Participle+tf*idf	Model based on public substring	Apple	??16.916	??17.007
Yao Ming	??14.785	??20.516	Apple	??16.916	??17.007
Yao Ming	??14.785	??20.516	The Department of Science and Technology	??13.146	??16.597
Object-oriented	??17.860	??17.764	The Department of Science and Technology	??13.146	??16.597
Object-oriented	??17.860	??17.764	Data mining	??11.593	??16.974

Found through experiments: the CH value that obtains based on the model of public substring is bigger than participle+tf*idf model, and wherein the CH value of " Yao Ming " and " data mining " has improved 5.8,5.4 respectively, and therefore, new method is better than classic method on the cluster effect.

The validity of cluster label is promptly readable extremely important for the user, and the phrase that only has physical meaning could be as the label of cluster.The computing formula of label validity is P=M/N, and wherein, M represents readable good number of tags, and N represents the number of all labels.Experimental result is seen accompanying drawing 1, and by accompanying drawing 1 as can be known, the phrase validity of new method is between 0.8-0.95, and traditional method major part is below 0.8.Therefore, the cluster label readability that obtains of new method is better than conventional model.

At last, at preceding 100 results of Baidu inquiry Web document as cluster, final effect is seen accompanying drawing 2 to keyword " apple " in invention.Can find out that from accompanying drawing 2 method that the present invention proposes can obtain cluster effect preferably.

What interpretation of result and comparison by experiment, the present invention proposed has than remarkable advantages in cluster effect, cluster performance and the Chinese clustering algorithm compared based on participle at aspects such as cluster label readability based on the Chinese Web document online clustering method of public substring.

Description of drawings

Fig. 1 be label effective sex ratio;

Fig. 2 is the cluster effect of input key word " apple " gained;

Fig. 3 is the test result (apple, Yao Ming, data mining) of three query words;

Fig. 4 is the process flow diagram based on the Chinese Web document online clustering method of public substring.

Embodiment:

1.Web document pre-service

In the return results of Chinese search engine (as Baidu etc.), usually contain some non-Chinese characters, as English character, space, punctuation mark or mess code etc.Because the emphasis of the present invention's research is the Chinese Web clustering documents, so before cluster, need replace processing to the non-Chinese content in the Search Results.

Pretreatment stage mainly replaces to the predefined separator of system with these non-Chinese characters.The non-Chinese character that needs to replace mainly comprises: space, numeral, English upper and lower case letter, Chinese and English punctuation mark (comprising full-shape and half-angle) and Chinese pause word (for example: " ", " ", " " etc.).The search-engine results item that will only be comprised Chinese character after the pre-service is with its input of extracting as public substring.

2. the public substring based on GSA extracts

● (Common Substring, CS): if character string u is the substring of character string S is again the substring of character string T to public substring, and then character string u is the public substring of character string S and T.If with Sub (S, u) expression character string u is the substring of character string S, then the public substring collection Com of character string S, T (S T) may be defined as:

Com (S, T) = {u | &ForAll; u, Sub (S, u)^Sub (T, u)} .

● (Longest Common Substring, LCS): the Longest Common Substring of character string S and T is meant the substring of length maximum in all public substrings of character string S and T to Longest Common Substring.If character string u satisfies: u ∈ Com (S, T) and

Claim that then u is character string S, the Longest Common Substring of T.

For example: given 2 length are 4 character string " abac ", " caba ".Their public substring has " ", " a ", " b ", " ab ", " ba ", " aba " and " c ", and wherein Longest Common Substring is " aba ".

Finding the solution of public substring problem, classic algorithm commonly used has dynamic programming algorithm and suffix tree algorithm.The former characteristics are to be easy to realize but time complexity is very high; And the latter's characteristics be time complexity only for linear, but implement difficulty relatively.This method adopts a mutation---the broad sense suffix array GSA algorithm of suffix tree (Suffix Tree) algorithm, realizes that the public substring between the text extracts.

Adopt the GSA algorithm to find the solution that the time complexity of public substring is O (n) between the character string.And the space complexity of GSA algorithm is S (n), and it is better than the suffix tree algorithm on space complexity.

Definition:

● suffix (Suffix): the suffix of a character string S is meant that from certain ad-hoc location i (i≤S.len (S)) up to a string of last character of S, it is the substring of S.This substring can be expressed as suffix (S, i), promptly Suffix (S, i)=subString (S, i, len (S)).

● (Suffix Array, SA): suffix array SA is corresponding one by one with character string S for the suffix array.Its each element is the subscript of S.Be len (SA)=len (S) and SA[i] ∈ 1,2 ..., len (S) } (1≤i≤len (S)), SA[i] ≠ SA[j] (i ≠ j).Simultaneously, this array also satisfies: Suffix (S, SA[i])＜Suffix (S, SA[i+1]), (1≤i＜len (S))

● broad sense suffix array (General ized Suffix Array, GSA): several character strings S ₁, S ₂..., S _nBroad sense suffix array be meant and use special end mark connection string S ₁, S ₂..., S _nThe back forms the suffix array of new character strings.

Illustrate, such as: for two character string S1=" abac " and S2=" caba ".Connect and the character string that obtains is abac@caba with special character @.For character string abac@caba, have 8 non-NULL suffix, sequence originally and sort according to the dictionary preface after sequence as shown in the table:

Before the ordering of non-NULL suffix and after the ordering

One-dimension array SA=[8 then, 6,1,3,7,2,5,4] be character string S ₁And S ₂Broad sense suffix array.

Obtain after the broad sense suffix array that two character strings connect, the longest common prefix of more adjacent substring in twos successively, all length is more than or equal to 1 the longest common prefix, the public substring of two character strings being asked exactly.

The public substring algorithm of more than finding the solution two character strings is extended to the public substring derivation algorithm of the individual character string of N (N＞1): for the individual character string S of N (N＞1) ₁, S ₂... S _N, obtain character string SE=S after it is stitched together with N-1 special character (needn't be different in twos) ₁a ₁S ₂a ₂... S _N-1a _N-1S _N, wherein, a _i(1≤i≤(N-1)) is the special character of insertion, and to all a _i, S _j, (1≤i≤N-1,1≤j≤N), have The suffix array of structure SE, the longest common prefix of more adjacent substring in twos can obtain N character string S then ₁, S ₂... S _NWhole public substring.

3. the foundation of text feature vector model

In the text based information retrieval process, the proper vector model of a text is a set of being made up of the certain characteristics in the text.In this text feature vector model based on public substring, each document D can be expressed as the proper vector that M public substring and respective weights thereof are formed.Here suppose:

● text to be clustered be D1, D2 ..., DN};

● the public substring sequence through filtration treatment is (S ₁, S ₂... S _N-1, S _n);

● function len (Sk) (k=1,2 ..., the n) length of expression character string Sk;

● (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj.The word frequency (Term frequency) that tf just usually uses in the information retrieval process;

● the contrary document frequency (Inversed document frequency) of the public substring Sk of function idf (Sk) expression;

● constant N represents the number as a result that search engine returns, and just we want the text number of cluster;

● function d f (Sk) expression comprises the number of the text of public substring Sk.

Based on above hypothesis, document D _jCan be expressed as vectorial following form:

D _j＝{w(S ₁，D _j)，w(S ₂，D _j)，...，w(S _n，D _j)}，(j＝1，2，...，N)

W (S wherein _k, D _j) (k=1,2 ..., n) be public substring S _kWith respect to text D _jWeight.It is as follows to propose weight calculation side's formula with reference to TF*IDF:

W (S _k, D _j)=log (1+tf (S _k, D _j)) * idf (S _k) * (len (S _k)) ^αWherein,

In a document, public substring weight and its length relation of being proportionate, promptly long more its weight of length is big more.In above formula, we are to public substring S _kLength l en (S _k) get the α power, to amplify the influence of long public substring to its weight, the value of concrete α needs to come by experiment to determine.

Utilize following formula, calculate the weight of whole public substrings after, just can use traditional similarity algorithm, for example the cosine similarity algorithm calculates the similarity between the text.The computing formula of two text similarities is as follows:

Sim (d_{i}, d_{j}) = \frac{d_{i} \cdot d_{j}}{| d_{i} | * | d_{j} |} .

4. clustering method and realization

Can obtain similarity matrix between text according to the text feature vector model of above-mentioned proposition, as shown in the table:

Similarity matrix

	??D1	??D2	??D3	…	??DN
	??D1	??D2	??D3	…	??DN	??D1	??Sim(1，2)	??Sim(1，3)	…	??Sim(1，N)
??D2			??Sim(2，3)	…	??Sim(2，N)	??D1	??Sim(1，2)	??Sim(1，3)	…	??Sim(1，N)
??D2			??Sim(2，3)	…	??Sim(2，N)	??…				??…
??DN						??…				??…

In the last table, D _i(1≤i≤N) is N result items (document that needs cluster), Sim (i, j) (1≤i, the ecbatic item D of j≤N) _iAnd D _jBetween similarity.

After obtaining similarity matrix, next step can adopt hierarchical clustering that result items is carried out cluster.Use hierarchical clustering can make total classification number less, be convenient to the user and locate information needed rapidly.Simultaneously, each class can also be segmented again.Traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering.

Suppose that wanting the total number of documents of cluster is N, N＞0, Ni represents i+1 not classified number of files of step, i=0,1, Set T _iRepresent i cluster, T _iIn element be classified document code of i+1 step, i=0,1 ...Browsing for the convenience of the user, the classification sum that we set after the cluster is no more than 20.Then clustering method can be described below: (seeing also claims 4).

In the method, calculation of similarity degree is a key factor that influences the cluster effect.The present invention has considered that public substring weight becomes the relation of α power with this substring length, if α is too big, then the effect of long public substring can be excessively enlarged, thereby has influence on the cluster effect, so the occurrence of α need obtain by experiment.The α value is incremented to 2 since 0 with step-length 0.1, and 100 Search Results that different key words are returned carry out cluster respectively, and key word comprises " apple ", " Yao Ming ", " data mining ".Adopting evaluating is between class distance ratio in the class.By the definition of between class distance ratio in the class as can be known, more little when distance in the class of a cluster, the cluster effect was best when between class distance was big more, so between class distance is than more hour in class, the clustering result effect is just good more, and corresponding α value is also just got over science.Experimental result is seen accompanying drawing 3.

Test result by top three key words as can be seen when the α interval [1.2,1.4] time, between class distance is than minimum in the class of cluster result.Find that through a large amount of experiments between class distance is 1.3 than the mean value of hour the most corresponding α value in the class.So the α value is 1.3.

Claims

1. Chinese Web document online clustering method based on public substring is characterized in that step is as follows:

(1) utilizes broad sense suffix array (Generalized Suffix Array, GSA) the public substring in the algorithm extraction Web document;

(2) the public substring that utilize to extract is set up the file characteristics vector model, and based on the similarity in twos of this Model Calculation Web document, obtains similarity matrix;

(3), adopt improved hierarchical clustering algorithm to realize the Web clustering documents based on this similarity matrix;

(4) in cluster process, with the public substring of weight maximum in the set of same cluster label as this cluster.

2. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: utilize the leaching process of GSA algorithm to be in the described step (1): suppose total N piece of writing document, every piece of document can be regarded a character string as, then total N character string S ₁, S ₂... S _N, wherein N is greater than 1, obtains character string SE=S after these character strings are stitched together with N-1 special character ₁a ₁S ₂a ₂... S _N-1a _N-1S _N, a wherein _iBe the special character of insertion, the span of i is 1≤i≤(N-1); And to all a _i, S _jHave

I wherein, the span of j is 1≤i≤N-1,1≤j≤N; The suffix array of structure SE, the longest common prefix of more adjacent substring in twos then, all length of these two adjacent substrings is more than or equal to 1 the longest common prefix, and the public substring of two character strings being asked exactly can obtain S by that analogy ₁, S ₂... S _NWhole public substring.

3. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: the file characteristics vector model of the foundation in the described step (2) is: at first suppose text to be clustered for D1, D2 ..., DN}; Public substring sequence through filtration treatment is S ₁, S ₂... S _N-1, S _nThe length of function len (Sk) expression character string Sk, k=1 wherein, 2 ..., n; (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj; The contrary document frequency of the public substring Sk of function idf (Sk) expression; Constant N represents the number as a result that search engine returns, and just wants the text number of cluster; Function d f (Sk) expression comprises the text number of public substring Sk;

Set up document D _jThe proper vector model: D _j={ w (S ₁, D _j), w (S ₂, D _j) ..., w (S _n, D _j), (j=1,2 ... N), the proper vector that promptly public substring and respective weights thereof are formed;

Wherein, w (S _k, D _j) for going here and there S _kWith respect to text D _jWeight, k=1 wherein, 2 ..., n;

W (S _k, D _j)=log (1+tf (S _k, D _j)) * idf (S _k) * (len (S _k)) ^αWherein,

idf (S_{k}) = \log (1 + \frac{N}{df (S_{k})});

It is 1.3 that the value of α is determined by experiment.

4. the described a kind of Chinese Web document online clustering method of claim 1 based on public substring, it is characterized in that: the improved hierarchical clustering algorithm process in the described step (3) is, suppose that wanting the total number of documents of cluster is N, N＞0 wherein, Ni represents i+1 not classified number of files of step, i is integer and i 〉=0, set T _iRepresent i cluster, T _iIn element be classified document code of i+1 step, i is integer and i 〉=0; Clustering method is as follows:

First step cluster comprised for four steps:

I. this moment, unclassified original document was counted N ₀=N gets initial threshold and is half of maximum similarity in the similarity matrix, promptly

θ_{0} = \frac{1}{2} \max_{i, j = 1,2 \cdot \cdot \cdot N, i &NotEqual; j} (Sim (D_{i}, D_{j}));

Ii. to any two document D _i, D _j, if Sim (D _i, D _j)＞θ ₀, then with D _i, D _jPut into set T ₀, i.e. T ₀=T0 ∪ { D _i, D _j;

If T iii. ₀In exist each other similarity less than threshold value θ ₀Document, promptly to all D _i, D _j∈ T ₀, i＜j is if exist Sim (D _i, D _j)＜θ ₀, then from T ₀The middle D that takes out _i, D _jMiddle subscript the greater, i.e. T ₀=T ₀-{ D _j, until T ₀In do not have such D _i, D _j

Iv. this moment is with T ₀In all elements be classified as a class, and get the maximum persons of occurrence number in the public substring of these elements, as such label, so far, this step cluster is finished;

Since the second step cluster, be generalized to the n step, can be expressed as follows:

N) in the n step, n is integer and n 〉=2, at this moment unclassified number of files

Wherein | T _i| be T _iThe number of middle element, if

Then get

Wherein, To D arbitrarily _i, D _j∈ T repeats the ii in the first step cluster, and iii in the iv step, can obtain T _N-1, and finish n step cluster, then enter the n+1 cluster in step (with the process of n step cluster); Until

Then not classified as yet document is classified as a class, label is " other ", finishes cluster.

5. the described a kind of Chinese Web document online clustering method based on public substring of claim 1 is characterized in that: browsing for the convenience of the user, the sum of setting the cluster classification in the described step (3) is no more than 20.