CN104794215A - Fast recursive clustering method suitable for large-scale data - Google Patents

Fast recursive clustering method suitable for large-scale data Download PDF

Info

Publication number
CN104794215A
CN104794215A CN201510206141.3A CN201510206141A CN104794215A CN 104794215 A CN104794215 A CN 104794215A CN 201510206141 A CN201510206141 A CN 201510206141A CN 104794215 A CN104794215 A CN 104794215A
Authority
CN
China
Prior art keywords
data
similarity
pointer
scan
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510206141.3A
Other languages
Chinese (zh)
Inventor
冀俊忠
高明霞
宋辰
刘金铎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510206141.3A priority Critical patent/CN104794215A/en
Publication of CN104794215A publication Critical patent/CN104794215A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a fast recursive clustering method suitable for large-scale data, and belongs to the technical field of data extraction. According to the recursive clustering method, a data set to be processed is regarded as a data sequence; firstly, a benchmark datum is selected from the data sequence randomly to serve as the representative datum of a cluster; then, the similarity value of the residual data and the benchmark datum is computed in a bidirectional scanning mode of the data sequence and compared with a user threshold value; the positions of the residual data in the sequence are adjusted according to the comparison result, the principle that the data with the similarity value being larger than the user threshold value are exchanged to the right side of the benchmark datum, and the data with the similarity value being smaller than the user threshold value are exchanged to the left side of the benchmark datum is adopted, the benchmark datum serves as the boundary, and data partition is finished once while scanning is conducted once; the data at the right side of the benchmark datum is divided into a cluster, the data at the left side of the benchmark datum are set to be a new data sequence, and bidirectional scanning is conducted continuously to finish recursive comparison division. The method is suitable for the large-scale data set clustering with a higher requirement for time.

Description

A kind of Fast Recursive clustering method of applicable large-scale data
Technical field
A kind of Fast Speed Clustering of applicable large-scale data belongs to Clustering in Data Mining research field.Be specifically related to a kind of clustering method being applicable to that the time is had higher requirements.
Background technology
Along with the rise of the universal of mobile calculation technique and Internet of Things, create mass data, especially the multi-medium data such as text, image, video.As described in " IDC prediction 2014 ", 2014, the size in " digital universe "-also namely create in a year, all digital informations-by continuation expansion copying and consume, reached about 6ZB (6,000,000,000,000 terabyte) more than 50%.The ultimate challenge of IT field is become with reasonable and acceptable time series analysis and these large data of excavation.Cluster in Data Mining or cluster analysis are often used in data prediction, it is a kind of common type of exploratory data analysis, be widely applied to numerous practical field, as medicine (classification of diseases, genetic analysis), chemistry (grouping of compound), social science (statistics classification), information retrieval (topic detection and tracking), computer graphical (Iamge Segmentation) etc.Therefore, the clustering algorithm that one can process large-scale data (Large Data) large data (big data) is even badly in need of in current Data Mining, for solving the large market demand needing fast processing, such as topic detection and tracking, Spam filtering, Iamge Segmentation etc.
Traditional clustering algorithm, such as hierarchical clustering algorithm, spectral clustering etc. need to set up similarity matrix by the similarity calculated between any two data objects usually.Although this kind of clustering algorithm has higher accuracy, its time complexity is often beyond O (N 2).For large-scale data, such as: sample quantity is close to 10 5-10 8, or large data, such as: sample number magnitude is more than 10 9, such time complexity cannot process or be difficult to stand.The common clustering algorithm for the treatment of large-scale data, their time complexity is generally O (n) or O (nlogn), such as, based on the CURE algorithm of Sampling techniques, and based on point and the BIRCH algorithm of autonomous thought.Therefore CURE algorithm can process aspherical data owing to have employed the representative bunch of multiple sample point, the scope of application is wider, but CURE algorithm final institute spended time is closely related with sampling ratio, how to determine a suitable sampling ratio inherently individual difficult problem, this algorithm number of needing user to input in advance bunch in addition, this is also a difficult problem to the large data of Unknown Distribution; BIRCH algorithm save internal memory, can increment type work, therefore scalability is fine, but cluster result relies on data input sequence, and the number of final " bunch " will be limited to the corresponding controling parameters of each node in characteristics tree.
Summary of the invention
The present invention is based on quicksort thought, propose a kind of clustering method of Fast Recursive, called after CAQS (Clustering Algorithm based on Quicksort, CAQS).CAQS method is a typical recursion method, the method regards pending data set as a data sequence, first the representative data that a random selected reference data is regarded as bunch from data sequence, then by the bilateral scanning mode of data sequence is calculated remaining data and this reference data Similarity value and and user's threshold value compare, according to comparative result adjustment remaining data position in the sequence, follow Similarity value and be greater than the exchanges data of user's threshold value on the right side of reference data, Similarity value is less than the exchanges data of user's threshold value on the left of reference data, take reference data as boundary, every run-down just completes a Data Segmentation, reference data comprises its right data and is divided into one bunch, reference data left data is set to new data sequence, continue the comparison division that bilateral scanning completes recursion.
The Fast Recursive clustering method of a kind of applicable large-scale data provided by the invention, concrete steps are as follows:
Step 1: the reference value of input user similarity threshold K and pending data sequence D, K is normally with the minimum similarity degree value in cluster between element;
Step 2: definition is used for pointer i, j of the pending data sequence of bilateral scanning, and i and j is pointed to respectively leftmost side data and the rightmost side data of D;
Step 3: Stochastic choice data are as reference value from D, and by leftmost side exchanges data in itself and sequence;
Step 4: scan from right to left, calculate the similarity of the j current data of specifying and reference value and compare with K, if similarity is less than user's threshold value, then exchange the data that i, j point to, and a step that i pointer is moved to right, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 5; If similarity is more than or equal to user's threshold value, j pointer, to the step that moves to left, judges scan pointer, if i>=j, performs step 6, otherwise continues to perform step 4.
Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, if similarity is more than or equal to user's threshold value, then exchange the data that i, j point to, and j pointer is to the step that moves to left, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 4; If similarity is less than user's threshold value, move to right i pointer a step, judges scan pointer, if i>=j, perform step 6, otherwise continue to perform step 5.
Step 6: reference value and right data thereof delimited as new bunch, the data assignment on the reference value left side is that new data sequence D returns step 1 and continues recurrence and perform, until stop when number is less than or equal to 1 in pending data sequence D.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
1-(a) initial data sequence and relevant initial value;
1-(b) from right to left in scanning process first time exchange data and exchange direction of scanning;
1-(c) from left to right in scanning process first time exchange data and exchange direction of scanning;
1-(d) from right to left in scanning process second time exchange data and exchange direction of scanning;
1-(e) from left to right in scanning process second time exchange data and exchange direction of scanning;
1-(f) from right to left in scanning process third time exchange data and exchange direction of scanning;
1-(g) first time recursive scanning terminates, and produces new bunch of C_1={d1, d3, d5};
Fig. 2 is the precision comparison figure of four kinds of algorithms, is from left to right followed successively by CURE, BIRCH, K-means, CAQS (the present invention).
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
Data sequence D={d1 in following examples, d2, d3, d4, d5, d6, d7, d8, d9}, known bunch has four, is C={C_1={d1, d3 respectively, d5}, C_2={d2, d6}, C_3={d4, d9}, C_4={d7, d8}}, and bunch in similarity between data be more than or equal to 0.8, between bunch, the similarity of data is all less than 0.8.In order to obtain correct cluster result, the similarity threshold inputted in specific operation process is set as 0.8.Use as follows based on the step of the recurrence clustering method of quicksort to this data sequence:
Step 1: input K=0.8, pending data sequence D;
Step 2: definition is used for pointer i, j of the pending data sequence of bilateral scanning as Suo Shi Fig. 1-(a), and i and j is pointed to respectively leftmost side data and the rightmost side data of pending data sequence;
Step 3: Stochastic choice data d1 is as reference data as Suo Shi Fig. 1-(a), because it is positioned at the leftmost side, therefore need not exchange;
Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d9) similarity is less than similarity threshold, as shown in Fig. 1-(b), there is first time exchanges data, and a step that i pointer is moved to right, judge scan pointer, because i<j then performs step 5 in conversion scan direction;
Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, because (d1, d2) similarity is less than similarity threshold, therefore move to right i pointer a step, also namely i pointer moves to right, and judges scan pointer, because i<j continues to perform step 5; Because (d1, d3) similarity is more than or equal to similarity threshold, second time exchanges data occurs as Suo Shi Fig. 1-(c), and j pointer is to the step that moves to left, judge scan pointer, because i<j then performs step 4 in conversion scan direction;
Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d8) similarity is less than similarity threshold, third time exchanges data is there is as Suo Shi Fig. 1-(d), and a step that i pointer is moved to right, judge scan pointer, because i<j then performs step 5 in conversion scan direction;
Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, because (d1, d5) similarity is more than or equal to similarity threshold, as Suo Shi Fig. 1-(e), there is the 4th secondary data exchange, and by j pointer to the step that moves to left, judge scan pointer, because i<j then performs step 4 in conversion scan direction;
Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, because (d1, d7) similarity is less than similarity threshold, as Suo Shi Fig. 1-(f), there is the 5th secondary data exchange, and a step that i pointer is moved to right, judge scan pointer, because i=j then performs step 6;
Step 6: reference value and right data thereof delimited as new bunch of C_1={d1, d3, d5}, left data assignment is that new data set D recurrence performs step 1-6, until stop when left data number is less than or equal to 1.
In order to evaluate and test this Fast Speed Clustering of our invention in precision and time-related effect, have selected three Classic Clustering Algorithms K-means, CURE, BIRCH as a comparison algorithm carried out series of experiments, experimental result adopts the assembly average of 5 random experiments.
Experiment condition: Intel (R) CoreTM 2quad CPU 2.4G, 2G internal memory, JAVA language realizes program, and Similarity measures mode have employed common Euclidean distance method.
Experimental data: the Vector Groups data set being the varying number generated by computer random, the real spherical data of random vector digital simulation, 10 totally bunches, each data use 30 features, the scope of each feature is the integer between [0,20].In order to distinguish 10 bunches, in simulation bunch, similarity is comparatively large, and bunch between the less principle of similarity, 30 features are divided into 10 groups, and 3 features often organized can regard the decisive or advantageous characteristic of corresponding bunch as distinguishing with other bunch.The advantageous characteristic span of each bunch is set as Random (10,20), and all the other feature-sets are Random (0,10).
Experimental result: Fig. 2 is the accuracy comparisons of four algorithms on 10,000 small data sample sets, and table 1 is the contrast that four algorithms are taken time on different scales data set.In experiment, the optimum configurations of each algorithm is as follows: CURE random sampling numbers are 10% of total sample number, and central point selects 5; A bunch number K value of K-means is set to 10; Three parameter L of BIRCH algorithm: the number of minimum submanifold in leaf node, T: the maximum radius of the subclass of leaf node, B: the number of the child that each " non-leaf nodes " contains at most, is set as L=30, T=5, B=10 respectively; The similarity threshold K=0.8 of algorithm in the present invention.
Because the clustering distribution of artificial data is known, so rand index conventional in accuracy selection Cluster Evaluation.Experimental data is the spherical data of class, can processing of four kinds of algorithms.As can be seen from Figure 2, for such small data set, the precision of four kinds of algorithms is relatively all higher, and the poorest BIRCH also can reach 85%, and the method in the present invention is suitable with CURE in precision, maintains about 94%.
As can be seen from Table 1, when Similarity Algorithm is identical, time efficiency of the present invention is far away higher than other several methods.This algorithm is in essence based on quicksort thought, and the time cost of each scanning is maximum is N, and have several bunches just to need to scan for several times, and number of clusters is all far smaller than sample number under normal circumstances, therefore, time proximity complexity can be regarded as O (N).
The table 1 algorithm time contrasts
Last it is noted that above example only in order to illustrate the present invention and and unrestricted technical scheme described in the invention; Therefore, although this instructions with reference to above-mentioned example to present invention has been detailed description, those of ordinary skill in the art should be appreciated that and still can modify to the present invention or equivalent to replace; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of right of the present invention.

Claims (1)

1. a Fast Recursive clustering method for applicable large-scale data, is characterized in that step is as follows:
Step 1: the reference value of input user similarity threshold K and pending data sequence D, K is the minimum similarity degree value in same cluster between element;
Step 2: definition is used for pointer i, j of the pending data sequence of bilateral scanning, and i and j is pointed to respectively leftmost side data and the rightmost side data of D;
Step 3: Stochastic choice data are as reference value from D, and by leftmost side exchanges data in itself and sequence;
Step 4: scan from right to left, calculate the similarity of current data that j specifies and reference value according to selected Similarity Measure mode and compare with K, if similarity is less than user's threshold value, then exchange the data that i, j point to, and a step that i pointer is moved to right, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 5; If similarity is more than or equal to user's threshold value, j pointer, to the step that moves to left, judges scan pointer, if i>=j, performs step 6, otherwise continues to perform step 4;
Step 5: scan from left to right, the similarity of current data that i specifies and reference value is calculated according to selected Similarity Measure mode, if similarity is more than or equal to user's threshold value, then exchange the data that i, j point to, and j pointer is to the step that moves to left, judge scan pointer, if i>=j, perform step 6, otherwise conversion scan direction performs step 4; If similarity is less than user's threshold value, move to right i pointer a step, judges scan pointer, if i>=j, perform step 6, otherwise continue to perform step 5;
Step 6: reference value and right data thereof delimited as new bunch, the data assignment on the reference value left side is that new data sequence D returns step 1 and continues recurrence and perform, until stop when number is less than or equal to 1 in pending data sequence D.
CN201510206141.3A 2015-04-27 2015-04-27 Fast recursive clustering method suitable for large-scale data Pending CN104794215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510206141.3A CN104794215A (en) 2015-04-27 2015-04-27 Fast recursive clustering method suitable for large-scale data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510206141.3A CN104794215A (en) 2015-04-27 2015-04-27 Fast recursive clustering method suitable for large-scale data

Publications (1)

Publication Number Publication Date
CN104794215A true CN104794215A (en) 2015-07-22

Family

ID=53559007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510206141.3A Pending CN104794215A (en) 2015-04-27 2015-04-27 Fast recursive clustering method suitable for large-scale data

Country Status (1)

Country Link
CN (1) CN104794215A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288045A (en) * 2018-01-31 2018-07-17 天讯瑞达通信技术有限公司 A kind of mobile video live streaming/monitor video acquisition source tagsort method
CN109447186A (en) * 2018-12-13 2019-03-08 深圳云天励飞技术有限公司 Clustering method and Related product
CN112200206A (en) * 2019-07-08 2021-01-08 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288045A (en) * 2018-01-31 2018-07-17 天讯瑞达通信技术有限公司 A kind of mobile video live streaming/monitor video acquisition source tagsort method
CN108288045B (en) * 2018-01-31 2020-11-24 天讯瑞达通信技术有限公司 Mobile video live broadcast/monitoring video acquisition source feature classification method
CN109447186A (en) * 2018-12-13 2019-03-08 深圳云天励飞技术有限公司 Clustering method and Related product
CN112200206A (en) * 2019-07-08 2021-01-08 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform
CN112200206B (en) * 2019-07-08 2024-02-27 浙江宇视科技有限公司 BIRCH algorithm improvement method, device and equipment based on distributed platform

Similar Documents

Publication Publication Date Title
Zhang et al. An end-to-end deep learning architecture for graph classification
CN104199827B (en) The high dimensional indexing method of large scale multimedia data based on local sensitivity Hash
CN110188225B (en) Image retrieval method based on sequencing learning and multivariate loss
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN108280236B (en) Method for analyzing random forest visual data based on LargeVis
CN110598061A (en) Multi-element graph fused heterogeneous information network embedding method
CN106228554A (en) Fuzzy coarse central coal dust image partition methods based on many attribute reductions
CN104881449A (en) Image retrieval method based on manifold learning data compression hash
CN109271427A (en) A kind of clustering method based on neighbour&#39;s density and manifold distance
Yang et al. Based on k-means and fuzzy k-means algorithm classification of Precipitation
CN104794215A (en) Fast recursive clustering method suitable for large-scale data
CN104361135A (en) Image search method
CN107480426A (en) From iteration case history archive cluster analysis system
CN104850594A (en) Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
JP2020135892A (en) Error correction method, apparatus, and computer-readable medium
CN109828996A (en) A kind of Incomplete data set rapid attribute reduction
CN109740421A (en) A kind of part classification method based on shape
Kumar et al. A new Initial Centroid finding Method based on Dissimilarity Tree for K-means Algorithm
CN114242178A (en) Method for quantitatively predicting biological activity of ER alpha antagonist based on gradient lifting decision tree
CN107180391B (en) Wind power data span selection method and device
CN113011589B (en) Co-evolution-based hyperspectral image band selection method and system
Feng et al. A genetic k-means clustering algorithm based on the optimized initial centers
Zhou et al. A dynamic pattern recognition approach based on neural network for stock time-series
CN110609832A (en) Non-repeated sampling method for streaming data
Al-Muallim et al. Unsupervised classification using immune algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150722