CN104850594A

CN104850594A - Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

Info

Publication number: CN104850594A
Application number: CN201510206140.9A
Authority: CN
Inventors: 冀俊忠; 高明霞; 宋辰; 刘金铎
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-04-27
Filing date: 2015-04-27
Publication date: 2015-08-19

Abstract

A non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data belongs to the technical field of data mining. The algorithm is characterized by using a two-layer circulation to realize data clustering, defining two positioning pointers in advance, randomly selecting one benchmark data to be viewed as representative data of a cluster from a data sequence, and exchanging to the rightmost side of the data to be processed, and simultaneously defining a scanning process pointer and initializing, scanning the data to be processed and calculating a similarity value of residual data and the benchmark data, and comparing with a user threshold, adjusting the position of the residual data in a sequence according to the comparison result, exchanging the data whose similarity value is more than the user threshold to the left side of the sequence, and exchanging the data whose similarity value is less than the user threshold to the right side of the sequence to finish data partitioning, finally resetting the positioning pointer, positioning new data to be processed and returning to a outer circulation to continuously execute until total data sequence clustering is finished. The algorithm is applied to cluster spherical data and a large data set which has high time requirements.

Description

A kind of quick onrecurrent clustering method of applicable large-scale data

Technical field

A kind of Fast Speed Clustering of applicable large-scale data belongs to Clustering in Data Mining research field.Be specifically related to a kind of clustering method being applicable to that the time is had higher requirements.

Background technology

Along with the rise of the universal of mobile calculation technique and Internet of Things, create mass data, especially the multi-medium data such as text, image, video.As described in " IDC prediction 2014 ", 2014, the size in " digital universe "-also namely create in a year, all digital informations-by continuation expansion copying and consume, reached about 6ZB (6,000,000,000,000 terabyte) more than 50%.The ultimate challenge of IT field is become with reasonable and acceptable time series analysis and these large data of excavation.Cluster in Data Mining or cluster analysis are often used in data prediction, it is a kind of common type of exploratory data analysis, be widely applied to numerous practical field, as medicine (classification of diseases, genetic analysis), chemistry (grouping of compound), social science (statistics classification), information retrieval (topic detection and tracking), computer graphical (Iamge Segmentation) etc.Therefore, the clustering algorithm that one can process large-scale data (Large Data) large data (big data) is even badly in need of in current Data Mining, for solving the large market demand needing fast processing, such as topic detection and tracking, Spam filtering, Iamge Segmentation etc.

Traditional clustering algorithm, such as hierarchical clustering algorithm, spectral clustering etc. need to set up similarity matrix by the similarity calculated between any two data objects usually.Although this kind of clustering algorithm has higher accuracy, its time complexity is often beyond O (N ²).For large-scale data, such as: sample quantity is close to 10 ⁵-10 ⁸, or large data, such as: sample number magnitude is more than 10 ⁹, such time complexity cannot process or be difficult to stand.The common clustering algorithm for the treatment of large-scale data, their time complexity is generally O (n) or O (nlogn), such as, based on the CURE algorithm of Sampling techniques, and based on point and the BIRCH algorithm of autonomous thought.Therefore CURE algorithm can process aspherical data owing to have employed the representative bunch of multiple sample point, the scope of application is wider, but CURE algorithm final institute spended time is closely related with sampling ratio, how to determine a suitable sampling ratio inherently individual difficult problem, this algorithm number of needing user to input in advance bunch in addition, this is also a difficult problem to the large data of Unknown Distribution; BIRCH algorithm save internal memory, can increment type work, therefore scalability is fine, but cluster result relies on data input sequence, and the number of final " bunch " will be limited to the corresponding controling parameters of each node in characteristics tree.

Summary of the invention

The present invention is based on quicksort thought, propose one onrecurrent clustering method fast, called after NR-CAQS (Non-Recursive Clustering Algorithm based on Quicksort, NR-CAQS).The method first defines two instruction pointer: end and start, for locating pending data.Then pending data are regarded as a data sequence, the representative data that a random selected reference data is regarded as bunch from data sequence, and the data of itself and end pointed are exchanged, definition scanning process pointer i and j, i are used to indicate position bunch in the sequence, and j is used for scanning from left to right, initialization i and j, the j that makes points to the data of start pointed, and i points to j pointer and to move to left the position of, illustrates that new bunch does not have element at present.3rd step by carry out scanning the Similarity value that calculates remaining data and this reference data to pending data and and user's threshold value compare, according to comparative result adjustment remaining data position in the sequence, follow Similarity value and be greater than the exchanges data of user's threshold value on the left of sequence, Similarity value is less than the exchanges data of user's threshold value on the right side of sequence.Pending data scanning once just completes a Data Segmentation.Finally by replacement end and start pointer, locate new pending data and continue circulation said process.

The quick onrecurrent clustering method of a kind of applicable large-scale data provided by the invention, concrete steps are as follows:

Step 1: input user similarity threshold K and comprise n data sample the reference value of initial pending data sequence D, K normally with the minimum similarity degree value in cluster between element;

Step 2: the instruction pointer defining pending data sequence head and tail is respectively start and end, also namely initialize start equals 1, end and equals n;

Step 3: Stochastic choice data are as reference value from pending data sequence, and the data that itself and end indicate exchanged;

Step 4: definition scan pointer i, j, and the j that initialize makes equals start, i equals j pointer to moving to left the value of 1;

Step 5: if j>=end, perform step 6, otherwise the sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and and user's threshold value compare, if similarity is more than or equal to user's threshold value, then performs i pointer and to move right 1 step, and exchange the data that pointer i and j specify, then j moves right 1 step, continues scan round, performs step 5; If similarity is less than user's threshold value, j moves right 1 step, continues scan round, performs step 5;

Step 6: move right i pointer 1 step, and the data exchanging that i and end specify;

Step 7: move right i pointer 1 step, if i is less than end, resets start pointer, make its data pointing to the instruction of i pointer and return step 3 to continue circulation, otherwise algorithm stops.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

1-(a) defines pending data head tail pointer and initialize;

1-(b) selection reference value and definition scan pointer initialize;

1-(c) scans from left to right and does not exchange data;

1-(d) from left to right in scanning process first time exchange data;

1-(e) from left to right in scanning process second time exchange data;

1-(f) exchanges reference value position;

1-(g) redefines pending data sequence by resetting start value.

Fig. 2 is the accuracy comparison figures of four kinds of algorithms in small data set, is from left to right followed successively by CURE, BIRCH, K-means, NR-CAQS (the present invention).

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

Data sequence D={d1 in following examples, d2, d3, d4, d5, d6, d7, d8, d9}, known bunch has four, is C={C_1={d1, d3 respectively, d5}, C_2={d2, d6}, C_3={d4, d9}, C_4={d7, d8}}, and bunch in similarity between data be more than or equal to 0.8, between bunch, the similarity of data is all less than 0.8.In order to obtain correct cluster result, the similarity threshold inputted in specific operation process is set as 0.8.Use as follows based on the step of the onrecurrent clustering method of quicksort to this data sequence:

Step 1: input user similarity threshold K=0.8 and comprise the initial pending data sequence D of 9 data samples;

Step 2: the instruction pointer defining pending data sequence head and tail is respectively start and end, and initialize start=1, end=9, as Fig. 1-(a);

Step 3: Stochastic choice data are as reference value from pending data sequence, and exchange, the data that itself and end indicate as Fig. 1-(b);

Step 4: definition scan pointer i, j, and initialize j=start, i point to one, the j left side, as Fig. 1-(b);

Step 5: as Fig. 1-(c), because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and and user's threshold value compare because the similarity of (d1, d9) be less than user's threshold value then j move right 1, continue scan round, perform step 5;

Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d2) similarity be less than user's threshold value then j move right 1, continue scan round, perform step 5;

Step 5: as Fig. 1-(d), because j<end, then the sequence that indicates of scan pointer start and end from left to right, calculates the similarity of current data that j specifies and reference value according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d3) similarity is more than or equal to user's threshold value, then performs i and move right 1, and exchange the data that pointer i and j specify, then j moves right 1, continues scan round, performs step 5;

Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d4) similarity be less than user's threshold value then j move right 1, continue scan round, perform step 5;

Step 5: as Fig. 1-(e), because j<end, the then sequence that indicates of scan pointer start and end from left to right, calculates the similarity of current data that j specifies and reference value according to selected Similarity Measure mode, and and user's threshold value compare, because (d1, d5) similarity is more than or equal to user's threshold value, then perform i and move right 1 and exchange the data that pointer i and j specify, then j moves right 1, continue scan round, perform step 5;

Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d6) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;

Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d7) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;

Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d8) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;

Step 5: because j>=end, then perform step 6;

Step 6: as Fig. 1-(f), i move right 1, exchanges the data that pointer i and end specifies;

Step 7: as Fig. 1-(g), i move right 1 because i be less than end then reset start make it point to i and return step 3 continue circulation.

So far, by the first bunch of C_1={d1, d3, d5} have split, and have reassigned pending data sequence end to end.

In order to evaluate and test we invention this Fast Speed Clustering in time-related effect, have selected three Classic Clustering Algorithms K-means, CURE, BIRCH as a comparison algorithm carried out series of experiments, experimental result adopts the assembly average of 5 random experiments.

Experiment condition: Intel (R) CoreTM ²quad CPU 2.4G, 2G internal memory, JAVA language realizes program, and Similarity measures mode have employed common Euclidean distance method.

Experimental data: the Vector Groups data set being the varying number generated by computer random, the real spherical data of random vector digital simulation, 10 totally bunches, each data use 30 features, the scope of each feature is the integer between [0,20].In order to distinguish 10 bunches, in simulation bunch, similarity is comparatively large, and bunch between the less principle of similarity, 30 features are divided into 10 groups, and 3 features often organized can regard the decisive or advantageous characteristic of corresponding bunch as distinguishing with other bunch.The advantageous characteristic span of each bunch is set as Random (10,20), and all the other feature-sets are Random (0,10).

Experimental result: Fig. 2 is the accuracy comparisons of four algorithms on 10,000 small data sample sets, and table 1 is the contrast that four algorithms are taken time on different scales data set.In experiment, the optimum configurations of each algorithm is as follows: CURE random sampling numbers are 10% of total sample number, and central point selects 5; A bunch number K value of K-means is set to 10; Three parameter L of BIRCH algorithm: the number of minimum submanifold in leaf node, T: the maximum radius of the subclass of leaf node, B: the number of the child that each " non-leaf nodes " contains at most, is set as L=30, T=5, B=10 respectively; The similarity threshold K=0.8 of algorithm in the present invention.

Because the clustering distribution of artificial data is known, so rand index conventional in accuracy selection Cluster Evaluation.Experimental data is the spherical data of class, can processing of four kinds of algorithms.As can be seen from Figure 2, for such small data set, the precision of four kinds of algorithms is relatively all higher, and the poorest BIRCH also can reach 85%, and the method in the present invention is suitable with CURE in precision, maintains about 94%.

As can be seen from Table 1, when Similarity Algorithm is identical, time efficiency of the present invention is far away higher than other several methods.This algorithm is in essence based on quicksort thought, and the time cost of each scanning is maximum is N, and have several bunches just to need to scan for several times, and number of clusters is all far smaller than sample number under normal circumstances, therefore, time proximity complexity can be regarded as O (N).

The table 1 algorithm time contrasts

Last it is noted that above example only in order to illustrate the present invention and and unrestricted technical scheme described in the invention; Therefore, although this instructions with reference to above-mentioned example to present invention has been detailed description, those of ordinary skill in the art should be appreciated that and still can modify to the present invention or equivalent to replace; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of right of the present invention.

Claims

1. a quick onrecurrent clustering method for applicable large-scale data, is characterized in that step is as follows: