CN104794215A

CN104794215A - Fast recursive clustering method suitable for large-scale data

Info

Publication number: CN104794215A
Application number: CN201510206141.3A
Authority: CN
Inventors: 冀俊忠; 高明霞; 宋辰; 刘金铎
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-04-27
Filing date: 2015-04-27
Publication date: 2015-07-22

Abstract

The invention provides a fast recursive clustering method suitable for large-scale data, and belongs to the technical field of data extraction. According to the recursive clustering method, a data set to be processed is regarded as a data sequence; firstly, a benchmark datum is selected from the data sequence randomly to serve as the representative datum of a cluster; then, the similarity value of the residual data and the benchmark datum is computed in a bidirectional scanning mode of the data sequence and compared with a user threshold value; the positions of the residual data in the sequence are adjusted according to the comparison result, the principle that the data with the similarity value being larger than the user threshold value are exchanged to the right side of the benchmark datum, and the data with the similarity value being smaller than the user threshold value are exchanged to the left side of the benchmark datum is adopted, the benchmark datum serves as the boundary, and data partition is finished once while scanning is conducted once; the data at the right side of the benchmark datum is divided into a cluster, the data at the left side of the benchmark datum are set to be a new data sequence, and bidirectional scanning is conducted continuously to finish recursive comparison division. The method is suitable for the large-scale data set clustering with a higher requirement for time.

Description

A kind of Fast Recursive clustering method of applicable large-scale data

Technical field

A kind of Fast Speed Clustering of applicable large-scale data belongs to Clustering in Data Mining research field.Be specifically related to a kind of clustering method being applicable to that the time is had higher requirements.

Background technology

Along with the rise of the universal of mobile calculation technique and Internet of Things, create mass data, especially the multi-medium data such as text, image, video.As described in " IDC prediction 2014 ", 2014, the size in " digital universe "-also namely create in a year, all digital informations-by continuation expansion copying and consume, reached about 6ZB (6,000,000,000,000 terabyte) more than 50%.The ultimate challenge of IT field is become with reasonable and acceptable time series analysis and these large data of excavation.Cluster in Data Mining or cluster analysis are often used in data prediction, it is a kind of common type of exploratory data analysis, be widely applied to numerous practical field, as medicine (classification of diseases, genetic analysis), chemistry (grouping of compound), social science (statistics classification), information retrieval (topic detection and tracking), computer graphical (Iamge Segmentation) etc.Therefore, the clustering algorithm that one can process large-scale data (Large Data) large data (big data) is even badly in need of in current Data Mining, for solving the large market demand needing fast processing, such as topic detection and tracking, Spam filtering, Iamge Segmentation etc.

Traditional clustering algorithm, such as hierarchical clustering algorithm, spectral clustering etc. need to set up similarity matrix by the similarity calculated between any two data objects usually.Although this kind of clustering algorithm has higher accuracy, its time complexity is often beyond O (N ²).For large-scale data, such as: sample quantity is close to 10 ⁵-10 ⁸, or large data, such as: sample number magnitude is more than 10 ⁹, such time complexity cannot process or be difficult to stand.The common clustering algorithm for the treatment of large-scale data, their time complexity is generally O (n) or O (nlogn), such as, based on the CURE algorithm of Sampling techniques, and based on point and the BIRCH algorithm of autonomous thought.Therefore CURE algorithm can process aspherical data owing to have employed the representative bunch of multiple sample point, the scope of application is wider, but CURE algorithm final institute spended time is closely related with sampling ratio, how to determine a suitable sampling ratio inherently individual difficult problem, this algorithm number of needing user to input in advance bunch in addition, this is also a difficult problem to the large data of Unknown Distribution; BIRCH algorithm save internal memory, can increment type work, therefore scalability is fine, but cluster result relies on data input sequence, and the number of final " bunch " will be limited to the corresponding controling parameters of each node in characteristics tree.

Summary of the invention

The present invention is based on quicksort thought, propose a kind of clustering method of Fast Recursive, called after CAQS (Clustering Algorithm based on Quicksort, CAQS).CAQS method is a typical recursion method, the method regards pending data set as a data sequence, first the representative data that a random selected reference data is regarded as bunch from data sequence, then by the bilateral scanning mode of data sequence is calculated remaining data and this reference data Similarity value and and user's threshold value compare, according to comparative result adjustment remaining data position in the sequence, follow Similarity value and be greater than the exchanges data of user's threshold value on the right side of reference data, Similarity value is less than the exchanges data of user's threshold value on the left of reference data, take reference data as boundary, every run-down just completes a Data Segmentation, reference data comprises its right data and is divided into one bunch, reference data left data is set to new data sequence, continue the comparison division that bilateral scanning completes recursion.

The Fast Recursive clustering method of a kind of applicable large-scale data provided by the invention, concrete steps are as follows:

Step 1: the reference value of input user similarity threshold K and pending data sequence D, K is normally with the minimum similarity degree value in cluster between element;

Step 2: definition is used for pointer i, j of the pending data sequence of bilateral scanning, and i and j is pointed to respectively leftmost side data and the rightmost side data of D;

Step 3: Stochastic choice data are as reference value from D, and by leftmost side exchanges data in itself and sequence;

Step 4: scan from right to left, calculate the similarity of the j current data of specifying and reference value and compare with K, if similarity is less than user's threshold value, then exchange the data that i, j point to, and a step that i pointer is moved to right, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 5; If similarity is more than or equal to user's threshold value, j pointer, to the step that moves to left, judges scan pointer, if i>=j, performs step 6, otherwise continues to perform step 4.

Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, if similarity is more than or equal to user's threshold value, then exchange the data that i, j point to, and j pointer is to the step that moves to left, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 4; If similarity is less than user's threshold value, move to right i pointer a step, judges scan pointer, if i>=j, perform step 6, otherwise continue to perform step 5.

Step 6: reference value and right data thereof delimited as new bunch, the data assignment on the reference value left side is that new data sequence D returns step 1 and continues recurrence and perform, until stop when number is less than or equal to 1 in pending data sequence D.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

1-(a) initial data sequence and relevant initial value;

1-(b) from right to left in scanning process first time exchange data and exchange direction of scanning;

1-(c) from left to right in scanning process first time exchange data and exchange direction of scanning;

1-(d) from right to left in scanning process second time exchange data and exchange direction of scanning;

1-(e) from left to right in scanning process second time exchange data and exchange direction of scanning;

1-(f) from right to left in scanning process third time exchange data and exchange direction of scanning;

1-(g) first time recursive scanning terminates, and produces new bunch of C_1={d1, d3, d5};

Fig. 2 is the precision comparison figure of four kinds of algorithms, is from left to right followed successively by CURE, BIRCH, K-means, CAQS (the present invention).

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

Data sequence D={d1 in following examples, d2, d3, d4, d5, d6, d7, d8, d9}, known bunch has four, is C={C_1={d1, d3 respectively, d5}, C_2={d2, d6}, C_3={d4, d9}, C_4={d7, d8}}, and bunch in similarity between data be more than or equal to 0.8, between bunch, the similarity of data is all less than 0.8.In order to obtain correct cluster result, the similarity threshold inputted in specific operation process is set as 0.8.Use as follows based on the step of the recurrence clustering method of quicksort to this data sequence:

Step 1: input K=0.8, pending data sequence D;

Step 2: definition is used for pointer i, j of the pending data sequence of bilateral scanning as Suo Shi Fig. 1-(a), and i and j is pointed to respectively leftmost side data and the rightmost side data of pending data sequence;

Step 3: Stochastic choice data d1 is as reference data as Suo Shi Fig. 1-(a), because it is positioned at the leftmost side, therefore need not exchange;

Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d9) similarity is less than similarity threshold, as shown in Fig. 1-(b), there is first time exchanges data, and a step that i pointer is moved to right, judge scan pointer, because i<j then performs step 5 in conversion scan direction;

Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, because (d1, d2) similarity is less than similarity threshold, therefore move to right i pointer a step, also namely i pointer moves to right, and judges scan pointer, because i<j continues to perform step 5; Because (d1, d3) similarity is more than or equal to similarity threshold, second time exchanges data occurs as Suo Shi Fig. 1-(c), and j pointer is to the step that moves to left, judge scan pointer, because i<j then performs step 4 in conversion scan direction;

Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d8) similarity is less than similarity threshold, third time exchanges data is there is as Suo Shi Fig. 1-(d), and a step that i pointer is moved to right, judge scan pointer, because i<j then performs step 5 in conversion scan direction;

Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, because (d1, d5) similarity is more than or equal to similarity threshold, as Suo Shi Fig. 1-(e), there is the 4th secondary data exchange, and by j pointer to the step that moves to left, judge scan pointer, because i<j then performs step 4 in conversion scan direction;

Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d7) similarity is less than similarity threshold, as Suo Shi Fig. 1-(f), there is the 5th secondary data exchange, and a step that i pointer is moved to right, judge scan pointer, because i=j then performs step 6;

Step 6: reference value and right data thereof delimited as new bunch of C_1={d1, d3, d5}, left data assignment is that new data set D recurrence performs step 1-6, until stop when left data number is less than or equal to 1.

In order to evaluate and test this Fast Speed Clustering of our invention in precision and time-related effect, have selected three Classic Clustering Algorithms K-means, CURE, BIRCH as a comparison algorithm carried out series of experiments, experimental result adopts the assembly average of 5 random experiments.

Experiment condition: Intel (R) CoreTM ²quad CPU 2.4G, 2G internal memory, JAVA language realizes program, and Similarity measures mode have employed common Euclidean distance method.

Experimental data: the Vector Groups data set being the varying number generated by computer random, the real spherical data of random vector digital simulation, 10 totally bunches, each data use 30 features, the scope of each feature is the integer between [0,20].In order to distinguish 10 bunches, in simulation bunch, similarity is comparatively large, and bunch between the less principle of similarity, 30 features are divided into 10 groups, and 3 features often organized can regard the decisive or advantageous characteristic of corresponding bunch as distinguishing with other bunch.The advantageous characteristic span of each bunch is set as Random (10,20), and all the other feature-sets are Random (0,10).

Experimental result: Fig. 2 is the accuracy comparisons of four algorithms on 10,000 small data sample sets, and table 1 is the contrast that four algorithms are taken time on different scales data set.In experiment, the optimum configurations of each algorithm is as follows: CURE random sampling numbers are 10% of total sample number, and central point selects 5; A bunch number K value of K-means is set to 10; Three parameter L of BIRCH algorithm: the number of minimum submanifold in leaf node, T: the maximum radius of the subclass of leaf node, B: the number of the child that each " non-leaf nodes " contains at most, is set as L=30, T=5, B=10 respectively; The similarity threshold K=0.8 of algorithm in the present invention.

Because the clustering distribution of artificial data is known, so rand index conventional in accuracy selection Cluster Evaluation.Experimental data is the spherical data of class, can processing of four kinds of algorithms.As can be seen from Figure 2, for such small data set, the precision of four kinds of algorithms is relatively all higher, and the poorest BIRCH also can reach 85%, and the method in the present invention is suitable with CURE in precision, maintains about 94%.

As can be seen from Table 1, when Similarity Algorithm is identical, time efficiency of the present invention is far away higher than other several methods.This algorithm is in essence based on quicksort thought, and the time cost of each scanning is maximum is N, and have several bunches just to need to scan for several times, and number of clusters is all far smaller than sample number under normal circumstances, therefore, time proximity complexity can be regarded as O (N).

The table 1 algorithm time contrasts

Last it is noted that above example only in order to illustrate the present invention and and unrestricted technical scheme described in the invention; Therefore, although this instructions with reference to above-mentioned example to present invention has been detailed description, those of ordinary skill in the art should be appreciated that and still can modify to the present invention or equivalent to replace; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of right of the present invention.

Claims

1. a Fast Recursive clustering method for applicable large-scale data, is characterized in that step is as follows:

Step 1: the reference value of input user similarity threshold K and pending data sequence D, K is the minimum similarity degree value in same cluster between element;

Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, if similarity is less than user's threshold value, then exchange the data that i, j point to, and a step that i pointer is moved to right, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 5; If similarity is more than or equal to user's threshold value, j pointer, to the step that moves to left, judges scan pointer, if i>=j, performs step 6, otherwise continues to perform step 4;

Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, if similarity is more than or equal to user's threshold value, then exchange the data that i, j point to, and j pointer is to the step that moves to left, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 4; If similarity is less than user's threshold value, move to right i pointer a step, judges scan pointer, if i>=j, perform step 6, otherwise continue to perform step 5;