CN104850594A - Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data - Google Patents

Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data Download PDF

Info

Publication number
CN104850594A
CN104850594A CN201510206140.9A CN201510206140A CN104850594A CN 104850594 A CN104850594 A CN 104850594A CN 201510206140 A CN201510206140 A CN 201510206140A CN 104850594 A CN104850594 A CN 104850594A
Authority
CN
China
Prior art keywords
data
pointer
similarity
sequence
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510206140.9A
Other languages
Chinese (zh)
Inventor
冀俊忠
高明霞
宋辰
刘金铎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510206140.9A priority Critical patent/CN104850594A/en
Publication of CN104850594A publication Critical patent/CN104850594A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data belongs to the technical field of data mining. The algorithm is characterized by using a two-layer circulation to realize data clustering, defining two positioning pointers in advance, randomly selecting one benchmark data to be viewed as representative data of a cluster from a data sequence, and exchanging to the rightmost side of the data to be processed, and simultaneously defining a scanning process pointer and initializing, scanning the data to be processed and calculating a similarity value of residual data and the benchmark data, and comparing with a user threshold, adjusting the position of the residual data in a sequence according to the comparison result, exchanging the data whose similarity value is more than the user threshold to the left side of the sequence, and exchanging the data whose similarity value is less than the user threshold to the right side of the sequence to finish data partitioning, finally resetting the positioning pointer, positioning new data to be processed and returning to a outer circulation to continuously execute until total data sequence clustering is finished. The algorithm is applied to cluster spherical data and a large data set which has high time requirements.

Description

A kind of quick onrecurrent clustering method of applicable large-scale data
Technical field
A kind of Fast Speed Clustering of applicable large-scale data belongs to Clustering in Data Mining research field.Be specifically related to a kind of clustering method being applicable to that the time is had higher requirements.
Background technology
Along with the rise of the universal of mobile calculation technique and Internet of Things, create mass data, especially the multi-medium data such as text, image, video.As described in " IDC prediction 2014 ", 2014, the size in " digital universe "-also namely create in a year, all digital informations-by continuation expansion copying and consume, reached about 6ZB (6,000,000,000,000 terabyte) more than 50%.The ultimate challenge of IT field is become with reasonable and acceptable time series analysis and these large data of excavation.Cluster in Data Mining or cluster analysis are often used in data prediction, it is a kind of common type of exploratory data analysis, be widely applied to numerous practical field, as medicine (classification of diseases, genetic analysis), chemistry (grouping of compound), social science (statistics classification), information retrieval (topic detection and tracking), computer graphical (Iamge Segmentation) etc.Therefore, the clustering algorithm that one can process large-scale data (Large Data) large data (big data) is even badly in need of in current Data Mining, for solving the large market demand needing fast processing, such as topic detection and tracking, Spam filtering, Iamge Segmentation etc.
Traditional clustering algorithm, such as hierarchical clustering algorithm, spectral clustering etc. need to set up similarity matrix by the similarity calculated between any two data objects usually.Although this kind of clustering algorithm has higher accuracy, its time complexity is often beyond O (N 2).For large-scale data, such as: sample quantity is close to 10 5-10 8, or large data, such as: sample number magnitude is more than 10 9, such time complexity cannot process or be difficult to stand.The common clustering algorithm for the treatment of large-scale data, their time complexity is generally O (n) or O (nlogn), such as, based on the CURE algorithm of Sampling techniques, and based on point and the BIRCH algorithm of autonomous thought.Therefore CURE algorithm can process aspherical data owing to have employed the representative bunch of multiple sample point, the scope of application is wider, but CURE algorithm final institute spended time is closely related with sampling ratio, how to determine a suitable sampling ratio inherently individual difficult problem, this algorithm number of needing user to input in advance bunch in addition, this is also a difficult problem to the large data of Unknown Distribution; BIRCH algorithm save internal memory, can increment type work, therefore scalability is fine, but cluster result relies on data input sequence, and the number of final " bunch " will be limited to the corresponding controling parameters of each node in characteristics tree.
Summary of the invention
The present invention is based on quicksort thought, propose one onrecurrent clustering method fast, called after NR-CAQS (Non-Recursive Clustering Algorithm based on Quicksort, NR-CAQS).The method first defines two instruction pointer: end and start, for locating pending data.Then pending data are regarded as a data sequence, the representative data that a random selected reference data is regarded as bunch from data sequence, and the data of itself and end pointed are exchanged, definition scanning process pointer i and j, i are used to indicate position bunch in the sequence, and j is used for scanning from left to right, initialization i and j, the j that makes points to the data of start pointed, and i points to j pointer and to move to left the position of, illustrates that new bunch does not have element at present.3rd step by carry out scanning the Similarity value that calculates remaining data and this reference data to pending data and and user's threshold value compare, according to comparative result adjustment remaining data position in the sequence, follow Similarity value and be greater than the exchanges data of user's threshold value on the left of sequence, Similarity value is less than the exchanges data of user's threshold value on the right side of sequence.Pending data scanning once just completes a Data Segmentation.Finally by replacement end and start pointer, locate new pending data and continue circulation said process.
The quick onrecurrent clustering method of a kind of applicable large-scale data provided by the invention, concrete steps are as follows:
Step 1: input user similarity threshold K and comprise n data sample the reference value of initial pending data sequence D, K normally with the minimum similarity degree value in cluster between element;
Step 2: the instruction pointer defining pending data sequence head and tail is respectively start and end, also namely initialize start equals 1, end and equals n;
Step 3: Stochastic choice data are as reference value from pending data sequence, and the data that itself and end indicate exchanged;
Step 4: definition scan pointer i, j, and the j that initialize makes equals start, i equals j pointer to moving to left the value of 1;
Step 5: if j>=end, perform step 6, otherwise the sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and and user's threshold value compare, if similarity is more than or equal to user's threshold value, then performs i pointer and to move right 1 step, and exchange the data that pointer i and j specify, then j moves right 1 step, continues scan round, performs step 5; If similarity is less than user's threshold value, j moves right 1 step, continues scan round, performs step 5;
Step 6: move right i pointer 1 step, and the data exchanging that i and end specify;
Step 7: move right i pointer 1 step, if i is less than end, resets start pointer, make its data pointing to the instruction of i pointer and return step 3 to continue circulation, otherwise algorithm stops.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
1-(a) defines pending data head tail pointer and initialize;
1-(b) selection reference value and definition scan pointer initialize;
1-(c) scans from left to right and does not exchange data;
1-(d) from left to right in scanning process first time exchange data;
1-(e) from left to right in scanning process second time exchange data;
1-(f) exchanges reference value position;
1-(g) redefines pending data sequence by resetting start value.
Fig. 2 is the accuracy comparison figures of four kinds of algorithms in small data set, is from left to right followed successively by CURE, BIRCH, K-means, NR-CAQS (the present invention).
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
Data sequence D={d1 in following examples, d2, d3, d4, d5, d6, d7, d8, d9}, known bunch has four, is C={C_1={d1, d3 respectively, d5}, C_2={d2, d6}, C_3={d4, d9}, C_4={d7, d8}}, and bunch in similarity between data be more than or equal to 0.8, between bunch, the similarity of data is all less than 0.8.In order to obtain correct cluster result, the similarity threshold inputted in specific operation process is set as 0.8.Use as follows based on the step of the onrecurrent clustering method of quicksort to this data sequence:
Step 1: input user similarity threshold K=0.8 and comprise the initial pending data sequence D of 9 data samples;
Step 2: the instruction pointer defining pending data sequence head and tail is respectively start and end, and initialize start=1, end=9, as Fig. 1-(a);
Step 3: Stochastic choice data are as reference value from pending data sequence, and exchange, the data that itself and end indicate as Fig. 1-(b);
Step 4: definition scan pointer i, j, and initialize j=start, i point to one, the j left side, as Fig. 1-(b);
Step 5: as Fig. 1-(c), because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and and user's threshold value compare because the similarity of (d1, d9) be less than user's threshold value then j move right 1, continue scan round, perform step 5;
Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d2) similarity be less than user's threshold value then j move right 1, continue scan round, perform step 5;
Step 5: as Fig. 1-(d), because j<end, then the sequence that indicates of scan pointer start and end from left to right, calculates the similarity of current data that j specifies and reference value according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d3) similarity is more than or equal to user's threshold value, then performs i and move right 1, and exchange the data that pointer i and j specify, then j moves right 1, continues scan round, performs step 5;
Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d4) similarity be less than user's threshold value then j move right 1, continue scan round, perform step 5;
Step 5: as Fig. 1-(e), because j<end, the then sequence that indicates of scan pointer start and end from left to right, calculates the similarity of current data that j specifies and reference value according to selected Similarity Measure mode, and and user's threshold value compare, because (d1, d5) similarity is more than or equal to user's threshold value, then perform i and move right 1 and exchange the data that pointer i and j specify, then j moves right 1, continue scan round, perform step 5;
Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d6) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;
Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d7) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;
Step 5: because j<end, the then sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and compare with user's threshold value, because (d1, d8) similarity be less than user's threshold value then then j move right 1, continue scan round, perform step 5;
Step 5: because j>=end, then perform step 6;
Step 6: as Fig. 1-(f), i move right 1, exchanges the data that pointer i and end specifies;
Step 7: as Fig. 1-(g), i move right 1 because i be less than end then reset start make it point to i and return step 3 continue circulation.
So far, by the first bunch of C_1={d1, d3, d5} have split, and have reassigned pending data sequence end to end.
In order to evaluate and test we invention this Fast Speed Clustering in time-related effect, have selected three Classic Clustering Algorithms K-means, CURE, BIRCH as a comparison algorithm carried out series of experiments, experimental result adopts the assembly average of 5 random experiments.
Experiment condition: Intel (R) CoreTM 2quad CPU 2.4G, 2G internal memory, JAVA language realizes program, and Similarity measures mode have employed common Euclidean distance method.
Experimental data: the Vector Groups data set being the varying number generated by computer random, the real spherical data of random vector digital simulation, 10 totally bunches, each data use 30 features, the scope of each feature is the integer between [0,20].In order to distinguish 10 bunches, in simulation bunch, similarity is comparatively large, and bunch between the less principle of similarity, 30 features are divided into 10 groups, and 3 features often organized can regard the decisive or advantageous characteristic of corresponding bunch as distinguishing with other bunch.The advantageous characteristic span of each bunch is set as Random (10,20), and all the other feature-sets are Random (0,10).
Experimental result: Fig. 2 is the accuracy comparisons of four algorithms on 10,000 small data sample sets, and table 1 is the contrast that four algorithms are taken time on different scales data set.In experiment, the optimum configurations of each algorithm is as follows: CURE random sampling numbers are 10% of total sample number, and central point selects 5; A bunch number K value of K-means is set to 10; Three parameter L of BIRCH algorithm: the number of minimum submanifold in leaf node, T: the maximum radius of the subclass of leaf node, B: the number of the child that each " non-leaf nodes " contains at most, is set as L=30, T=5, B=10 respectively; The similarity threshold K=0.8 of algorithm in the present invention.
Because the clustering distribution of artificial data is known, so rand index conventional in accuracy selection Cluster Evaluation.Experimental data is the spherical data of class, can processing of four kinds of algorithms.As can be seen from Figure 2, for such small data set, the precision of four kinds of algorithms is relatively all higher, and the poorest BIRCH also can reach 85%, and the method in the present invention is suitable with CURE in precision, maintains about 94%.
As can be seen from Table 1, when Similarity Algorithm is identical, time efficiency of the present invention is far away higher than other several methods.This algorithm is in essence based on quicksort thought, and the time cost of each scanning is maximum is N, and have several bunches just to need to scan for several times, and number of clusters is all far smaller than sample number under normal circumstances, therefore, time proximity complexity can be regarded as O (N).
The table 1 algorithm time contrasts
Last it is noted that above example only in order to illustrate the present invention and and unrestricted technical scheme described in the invention; Therefore, although this instructions with reference to above-mentioned example to present invention has been detailed description, those of ordinary skill in the art should be appreciated that and still can modify to the present invention or equivalent to replace; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of right of the present invention.

Claims (1)

1. a quick onrecurrent clustering method for applicable large-scale data, is characterized in that step is as follows:
Step 1: input user similarity threshold K and comprise n data sample the reference value of initial pending data sequence D, K normally with the minimum similarity degree value in cluster between element;
Step 2: the instruction pointer defining pending data sequence head and tail is respectively start and end, also namely initialize start equals 1, end and equals n;
Step 3: Stochastic choice data are as reference value from pending data sequence, and the data that itself and end indicate exchanged;
Step 4: definition scan pointer i, j, and the j that initialize makes equals start, i equals j pointer to moving to left the value of 1;
Step 5: if j>=end, perform step 6, otherwise the sequence of scan pointer start and end instruction from left to right, the similarity of current data that j specifies and reference value is calculated according to selected Similarity Measure mode, and and user's threshold value compare, if similarity is more than or equal to user's threshold value, then performs i pointer and to move right 1 step, and exchange the data that pointer i and j specify, then j moves right 1 step, continues scan round, performs step 5; If similarity is less than user's threshold value, j moves right 1 step, continues scan round, performs step 5;
Step 6: move right i pointer 1 step, and the data exchanging that i and end specify;
Step 7: move right i pointer 1 step, if i is less than end, resets start pointer, make its data pointing to the instruction of i pointer and return step 3 to continue circulation, otherwise algorithm stops.
CN201510206140.9A 2015-04-27 2015-04-27 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data Pending CN104850594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510206140.9A CN104850594A (en) 2015-04-27 2015-04-27 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510206140.9A CN104850594A (en) 2015-04-27 2015-04-27 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

Publications (1)

Publication Number Publication Date
CN104850594A true CN104850594A (en) 2015-08-19

Family

ID=53850239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510206140.9A Pending CN104850594A (en) 2015-04-27 2015-04-27 Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

Country Status (1)

Country Link
CN (1) CN104850594A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909824A (en) * 2019-12-09 2020-03-24 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment
CN112183567A (en) * 2019-07-05 2021-01-05 浙江宇视科技有限公司 Optimization method, device, equipment and storage medium of BIRCH algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183567A (en) * 2019-07-05 2021-01-05 浙江宇视科技有限公司 Optimization method, device, equipment and storage medium of BIRCH algorithm
CN112183567B (en) * 2019-07-05 2024-02-06 浙江宇视科技有限公司 BIRCH algorithm optimization method, device, equipment and storage medium
CN110909824A (en) * 2019-12-09 2020-03-24 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment
CN110909824B (en) * 2019-12-09 2022-10-28 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN106777038B (en) A kind of ultralow complexity image search method retaining Hash based on sequence
US9330341B2 (en) Image index generation based on similarities of image features
KR102305568B1 (en) Finding k extreme values in constant processing time
WO2016062044A1 (en) Model parameter training method, device and system
CN111125469B (en) User clustering method and device of social network and computer equipment
TWI464604B (en) Data clustering method and device, data processing apparatus and image processing apparatus
CN110188225A (en) A kind of image search method based on sequence study and polynary loss
CN108427745A (en) The image search method of visual dictionary and adaptive soft distribution based on optimization
CN107832456A (en) A kind of parallel KNN file classification methods based on the division of critical Value Data
JP2020135892A (en) Error correction method, apparatus, and computer-readable medium
CN109492682A (en) A kind of multi-branched random forest data classification method
Golge et al. Conceptmap: Mining noisy web data for concept learning
Valem et al. Unsupervised similarity learning through Cartesian product of ranking references
Nayini et al. A novel threshold-based clustering method to solve K-means weaknesses
CN104794215A (en) Fast recursive clustering method suitable for large-scale data
CN104850594A (en) Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
CN109919057B (en) Multi-mode fusion gesture recognition method based on efficient convolutional neural network
CN110209895B (en) Vector retrieval method, device and equipment
WO2015109781A1 (en) Method and device for determining parameter of statistical model on the basis of expectation maximization
Liu et al. Learning distilled graph for large-scale social network data clustering
CN109885685A (en) Method, apparatus, equipment and the storage medium of information data processing
CN105718950B (en) A kind of semi-supervised multi-angle of view clustering method based on structural constraint
CN108090514B (en) Infrared image identification method based on two-stage density clustering
Aleb et al. An improved K-means algorithm for DNA sequence clustering
CN109977787A (en) A kind of Human bodys&#39; response method of multi-angle of view

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150819