CN107341210A - C DBSCAN K clustering algorithms under Hadoop platform - Google Patents

C DBSCAN K clustering algorithms under Hadoop platform Download PDF

Info

Publication number
CN107341210A
CN107341210A CN201710495491.5A CN201710495491A CN107341210A CN 107341210 A CN107341210 A CN 107341210A CN 201710495491 A CN201710495491 A CN 201710495491A CN 107341210 A CN107341210 A CN 107341210A
Authority
CN
China
Prior art keywords
cluster
data
read
dbscan
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710495491.5A
Other languages
Chinese (zh)
Other versions
CN107341210B (en
Inventor
王彬
安涛
吕征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUNMNET TECHNOLOGY Co.,Ltd.
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201710495491.5A priority Critical patent/CN107341210B/en
Publication of CN107341210A publication Critical patent/CN107341210A/en
Application granted granted Critical
Publication of CN107341210B publication Critical patent/CN107341210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

C DBSCAN K clustering algorithms under Hadoop platform, comprise the following steps:Step 1, the cluster that can be in communication with each other is established;Step 2, Hadoop platform is established for cluster;Step 3, ordered using dfs-put and data set A to be clustered is uploaded to HDFS;Step 4, Canopy clustering algorithms are performed and initial clustering is carried out to the data in A, obtain the cluster result of coarseness;Step 5, k d trees are constructed in the cluster that step 4 obtains;Step 6, the cluster obtained to step 4 performs DBSCAN algorithms, using the epsilon neighborhood of data object in k d tree queries each cluster, exports cluster result;Step 7, the cluster for having identical data in step 6 is merged, exports cluster result.It is low that the algorithm of the present invention solves the problems, such as that DBSCAN clustering algorithms present in prior art cluster efficiency on large-scale dataset.

Description

C-DBSCAN-K clustering algorithms under Hadoop platform
Technical field
The invention belongs to computer data digging technology field, the C-DBSCAN-K clusters being related under a kind of Hadoop platform Algorithm.
Background technology
Nowadays, Internet technology is quickly grown, and Internet gos deep into the life of people, and modern society comes into a letter The epoch of breathization, substantial amounts of data message spread all over the place.When in face of the data of magnanimity, top priority is exactly it to be carried out rationally Classification, cluster analysis is just a process that.Using cluster, people can from the data set comprising a large amount of objects intelligence , valuable classificating knowledge can be automatically identified, obtains the distribution of data, observe the difference between different clusters, And on this basis, some specific gatherings are closed and do deeper analysis.Searched in business intelligence, image steganalysis, Web The fields such as rope, have all widely used Clustering Analysis Technology.
However, with developing rapidly for Internet era and widely using for mobile device, data message exponentially level Increase, the clustering algorithm run on traditional, unit can not meet the needs of people in efficiency.Hadoop is distributed flat Platform is the sharp weapon for handling big data, and it provides condition for data mining, how using Hadoop progress data minings, how will Algorithm combination MapReduce model that is traditional, being run on unit carries out Distributed Design, can be distributed using Hadoop Formula platform efficiently handles the data of magnanimity, and tool has very important significance.
The content of the invention
It is an object of the invention to provide the C-DBSCAN-K clustering algorithms under a kind of Hadoop platform, solves prior art Present in DBSCAN clustering algorithms on large-scale dataset cluster efficiency it is low the problem of.
The technical solution adopted in the present invention is the C-DBSCAN-K clustering algorithms under Hadoop platform, including following step Suddenly:
Step 1, multiple stage computers are connected in same LAN, for every computer as a node, foundation can The cluster being in communication with each other;
Step 2, Hadoop platform is established for cluster;
Step 3, data set A to be clustered is uploaded to Hadoop distributions using Hadoop distributed document order dfs-put Formula file system;
Step 4, the data that Canopy clustering algorithms are treated in cluster data collection A are performed and carry out initial clustering, obtain coarse grain The cluster result of degree;
Step 5, k-d tree is constructed in the cluster that step 4 obtains;
Step 6, the cluster that is obtained to step 4 performs DBSCAN algorithms, in query process, the k-d tree that is constructed using step 5 ε-neighborhood of data object in each cluster of inquiry, export the cluster result of DBSCAN algorithms;
Step 7, the cluster for having identical data in step 6 is merged, exports cluster result.
Step 2 is specially:
It is each node installation redhat6.2 operating systems in cluster first;Then it is that each node is pacified in cluster Hadoop2.2.0 files are filled, and are each node installation jdk1.8.0_65 files in cluster;Configure each node in cluster On redhat6.2 systems .bashrc files so that the Hadoop2.2.0 files on the redhat6.2 system relationships node With the jdk1.8.0_65 files on the node;Configure the hadoop-env.sh texts on each node in Hadoop2.2.0 files Part, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml files, hdfs- Site.xml files, mapred-site.xml files and yarn-site.xml files.
Step 4 concretely comprises the following steps:
Step 4.1, center point set is determined;
Step 4.2, the data in data set A to be clustered are clustered according to center point set.
Step 4.1 concretely comprises the following steps:
Step 4.1.1, start first Map task, scan and read in the data in data set A to be clustered;
Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky;To the data read in every time, if KEY1 is sky, then the data of reading is added in KEY1;If KEY1 is not sky, the data read in are calculated using formula (1) The distance dist1 of central point into KEY1:
Dist1=dist (di,dj)=| xi1-xj2|+|xi2-xj2|+…+|xip-xjp| (1)
Wherein, diFor i-th of data in data set A to be clustered, di=(xi1, xi2..., xip), xi1, xi2..., xipFor diP numerical attribute, djFor j-th of central point in KEY1, dj=(xj1, xj2..., xjp), xj1, xj2..., xjpFor djP Individual numerical attribute, dist (di, dj) represent diTo djManhatton distance;
Step 4.1.3, if for di, d be presentjSo that dist (di,dj)<T1, then by diIt is added in KEY1, renewal is simultaneously Export KEY1;Wherein, T1 is the initial distance threshold value of setting;
Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map tasks output;Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2;If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.
Step 4.2 concretely comprises the following steps:
Step 4.2.1, start second Map task, read in data in collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out;
Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using formula (1) dist3;
Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and therewith distance be less than T2 Data to be clustered form set B, output set B;Wherein, T2 is the initial distance threshold value of setting;
Step 4.2.4, start second Reduce task, read in several set B of second Map tasks output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list);Wherein, key represents a central point, and list represents all data in addition to key in same cluster.
Step 5 is specially:Start the 3rd Map task, read in the cluster of second Reduce tasks output, read in every time One cluster, k-d tree is constructed in the data of the cluster.
Step 7 concretely comprises the following steps:
Step 7.1, start the 4th Map task, read in the cluster that step 6 exports;
Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C;Judge to read in every time Cluster whether with the cluster in set C have identical data, if it is, by the cluster with reading have the cluster of identical data from Take out in set C, and merged with the cluster of reading, is added to after the cluster after merging is removed into duplicate data in set C;
Step 7.3, after all clusters being read into the 4th Map task have been handled by step 7.2, output set C;
Step 7.4, start the 4th Reduce task, read in the cluster in the set C that step 7.3 exports;Initialization one Individual set D, reads in one and clusters and be simultaneously added in set D every time, and whether clustering for judging to read in the cluster in set D has phase Same data, if it is, the cluster that the cluster read in this has identical data is taken out from set D, and read in this Cluster merge, be added to after the cluster after merging is removed into duplicate data in set D;
Step 7.5, after all clusters being read into the 4th Reduce task have been handled by step 7.4, output set D, as cluster result.
The invention has the advantages that the C-DBSCAN-K clustering algorithms under Hadoop platform of the present invention, first, are used Canopy clustering algorithms, it is quickly obtained the cluster result of coarseness;Then, k-d tree is constructed on the cluster result of coarseness Data structure, and DBSCAN algorithms are performed, using ε-contiguous range of k-d tree query object, the operation for accelerating DBSCAN is fast Degree;Finally, merge the cluster with same object, obtain final cluster result.C-DBSCAN-K under Hadoop platform gathers Class algorithm is fast and effective when handling large data sets, in the case where keeping the cluster degree of accuracy not reduce, significantly improves The execution efficiency of DBSCAN algorithms.
Brief description of the drawings
Fig. 1 is the flow chart of the C-DBSCAN-K clustering algorithms under Hadoop platform;
Fig. 2 is C-DBSCAN-K clustering algorithms under the Hadoop platform figure compared with the cluster result of DBSCAN algorithms;
Fig. 3 is C-DBSCAN-K clustering algorithms under the Hadoop platform figure compared with the execution time of DBSCAN algorithms.
Embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.
As shown in figure 1, the C-DBSCAN-K clustering algorithms under Hadoop platform, comprise the following steps:
Step 1, multiple stage computers are connected in same LAN, for every computer as a node, foundation can The cluster being in communication with each other;
Step 2, Hadoop platform is established for cluster;
Step 2 is specially:It is each node installation redhat6.2 operating systems in cluster first;Then it is in cluster Each node installation Hadoop2.2.0 files, and be each node installation jdk1.8.0_65 files in cluster;Config set The .bashrc files of redhat6.2 systems in group on each node so that on the redhat6.2 system relationships node Jdk1.8.0_65 files on Hadoop2.2.0 files and the node;Configure on each node in Hadoop2.2.0 files Hadoop-env.sh files, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml File, hdfs-site.xml files, mapred-site.xml files and yarn-site.xml files.
Step 3, data set A to be clustered is uploaded to Hadoop distributions using Hadoop distributed document order dfs-put Formula file system;
Step 4, the data that Canopy clustering algorithms are treated in cluster data collection A are performed and carry out initial clustering, obtain coarse grain The cluster result of degree;
Step 4 concretely comprises the following steps:
Step 4.1, center point set is determined;
Step 4.1 concretely comprises the following steps:
Step 4.1.1, start first Map task, scan and read in the data in data set A to be clustered;
Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky;To the data read in every time, if KEY1 is sky, then the data of reading is added in KEY1;If KEY1 is not sky, the data read in are calculated using formula (1) The distance dist1 of central point into KEY1:
Dist1=dist (di,dj)=| xi1-xj2|+|xi2-xj2|+…+|xip-xjp| (1)
Wherein, diFor i-th of data in data set A to be clustered, di=(xi1, xi2..., xip), xi1, xi2..., xipFor diP numerical attribute, djFor j-th of central point in KEY1, dj=(xj1, xj2..., xjp), xj1, xj2..., xjpFor djP Individual numerical attribute, dist (di, dj) represent diTo djManhatton distance;
Step 4.1.3, if for di, d be presentjSo that dist (di,dj)<T1, then by diIt is added in KEY1, renewal is simultaneously Export KEY1;Wherein, T1 is the initial distance threshold value of setting;
Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map tasks output;Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2;If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.
Step 4.2, the data in data set A to be clustered are clustered according to center point set KEY2.
Step 4.2 concretely comprises the following steps:
Step 4.2.1, start second Map task, read in data in collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out;
Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using formula (1) dist3;
Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and therewith distance be less than T2 Data to be clustered form set B, output set B;Wherein, T2 is the initial distance threshold value of setting;
Step 4.2.4, start second Reduce task, read in several set B of second Map tasks output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list);Wherein, key represents a central point, and list represents all data in addition to key in same cluster.
Step 5, k-d tree is constructed in the cluster that step 4 obtains;
Step 5 is specially:Start the 3rd Map task, read in the cluster of second Reduce tasks output, read in every time One cluster, k-d tree is constructed in the data of the cluster.
Step 6, the cluster that is obtained to step 4 performs DBSCAN algorithms, in query process, the k-d tree that is constructed using step 5 ε-neighborhood of data object in each cluster of inquiry, export the cluster result of DBSCAN algorithms;
Step 7, the cluster for having identical data in step 6 is merged, exports cluster result.
Step 7 concretely comprises the following steps:
Step 7.1, start the 4th Map task, read in the cluster that step 6 exports;
Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C;Judge to read in gathers Whether class with the cluster in set C has identical data, if it is, by the cluster read in this have the cluster of identical data from Take out in set C, and merged with the cluster of this reading, be added to set after the cluster after merging is removed into duplicate data In C;
Step 7.3, after all clusters being read into the 4th Map task have been handled by step 7.2, output set C;
Step 7.4, start the 4th Reduce task, read in the cluster in the set C that step 7.3 exports;Initialization one Individual set D, every time read in one cluster simultaneously be added in set D, judge read in every time cluster whether with the cluster in set D There are identical data, if it is, the cluster that the cluster with reading has identical data is taken out from set D, and it is poly- with reading Class merges, and is added to after the cluster after merging is removed into duplicate data in set D;
Step 7.5, after all clusters being read into the 4th Reduce task have been handled by step 7.4, output set D, as cluster result.
The four groups of simulated data sets generated with R language packs:Face data sets, spirals data sets, cassini data sets, Exemplified by hypercube data sets are respectively as data set to be clustered, the C-DBSCAN-K clusters under Hadoop platform are respectively adopted The data that algorithm is concentrated with DBSCAN algorithms to 4 data cluster;Fig. 2 (a), Fig. 2 (b) are the C- under Hadoop platform DBSCAN-K clustering algorithms and operation result of the DBSCAN algorithms on face data sets, Fig. 2 (c), Fig. 2 (d) put down for Hadoop C-DBSCAN-K clustering algorithms and operation result of the DBSCAN algorithms on spirals data sets under platform, Fig. 2 (e), Fig. 2 (f) For the C-DBSCAN-K clustering algorithms under Hadoop platform and operation result of the DBSCAN algorithms on cassini data sets, Fig. 2 (g), Fig. 2 (h) is the C-DBSCAN-K clustering algorithms under Hadoop platform and DBSCAN algorithms on hypercube data sets Operation result;Rectangle frame in Fig. 2 represents noise spot, and the shape of other different gray scales represents different clusters, can be obtained by Fig. 2 Go out:C-DBSCAN-K clustering algorithms under Hadoop platform are identical with the cluster that DBSCAN algorithms generate on each data set , and the noise spot identified is also identical, i.e., and the accuracy rate of two kinds algorithms is identical.
As shown in figure 3, the C-DBSCAN-K clustering algorithms and DBSCAN algorithms under Hadoop platform are to including 120k bars The face data sets of data, comprising 150k data spirals data sets, comprising 180k data cassini data sets, include When 200k data hypercube data sets are clustered, the C-DBSCAN-K clustering algorithms under Hadoop platform compare DBSCAN Riming time of algorithm is short, and the C-DBSCAN-K clustering algorithms under Hadoop platform have higher execution efficiency.
In summary, the C-DBSCAN-K clustering algorithms under Hadoop platform of the present invention, first, clustered and calculated using Canopy Method, it is quickly obtained the cluster result of coarseness;Then, k-d tree data structure is constructed on the cluster result of coarseness, and is held Row DBSCAN algorithms, using ε-contiguous range of k-d tree query object, accelerate the DBSCAN speed of service;Finally, tool is merged There is the cluster of same object, obtain final cluster result.C-DBSCAN-K clustering algorithms under Hadoop platform are big in processing It is fast and effective during data set, in the case where keeping the cluster degree of accuracy not reduce, significantly improve the execution of DBSCAN algorithms Efficiency.

Claims (7)

  1. C-DBSCAN-K clustering algorithms under 1.Hadoop platforms, it is characterised in that comprise the following steps:
    Step 1, multiple stage computers are connected in same LAN, every computer can be mutual as a node, foundation The cluster of communication;
    Step 2, Hadoop platform is established for the cluster;
    Step 3, data set A to be clustered is uploaded to the distributed texts of Hadoop using Hadoop distributed document order dfs-put Part system;
    Step 4, Canopy clustering algorithms are performed and initial clustering is carried out to the data in the data set A to be clustered, obtain coarse grain The cluster result of degree;
    Step 5, k-d tree is constructed in the cluster that the step 4 obtains;
    Step 6, the cluster that is obtained to the step 4 performs DBSCAN algorithms, in query process, uses what the step 5 constructed ε-neighborhood of data object in each cluster of k-d tree inquiry, export the cluster result of DBSCAN algorithms;
    Step 7, the cluster for having identical data in the step 6 is merged, exports cluster result.
  2. 2. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 2 are specially:
    It is each node installation redhat6.2 operating systems in cluster first;Then it is each node installation in cluster Hadoop2.2.0 files, and be each node installation jdk1.8.0_65 files in cluster;Configure in cluster on each node The redhat6.2 systems .bashrc files so that on the redhat6.2 system relationships node Hadoop2.2.0 text Jdk1.8.0_65 files on part and the node;Hadoop- on each node of configuration in Hadoop2.2.0 files Env.sh files, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml files, Hdfs-site.xml files, mapred-site.xml files and yarn-site.xml files.
  3. 3. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 4 concretely comprise the following steps:
    Step 4.1, center point set is determined;
    Step 4.2, the data in data set A to be clustered are clustered according to the center point set.
  4. 4. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 3, it is characterised in that the step 4.1 concretely comprise the following steps:
    Step 4.1.1, start first Map task, scan and read in the data in the data set A to be clustered;
    Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky;To the data read in every time, if KEY1 is Sky, then the data of reading are added in KEY1;If KEY1 is not sky, using the data that formula (1) calculating is read in KEY1 In central point distance dist1:
    Dist1=dist (di,dj)=| xi1-xj2|+|xi2-xj2|+…+|xip-xjp| (1)
    Wherein, diFor i-th of data in data set A to be clustered, di=(xi1, xi2..., xip), xi1, xi2..., xipFor diP Individual numerical attribute, djFor j-th of central point in KEY1, dj=(xj1, xj2..., xjp), xj1, xj2..., xjpFor djP number Value attribute, dist (di, dj) represent diTo djManhatton distance;
    Step 4.1.3, if for di, d be presentjSo that dist (di,dj)<T1, then by diIt is added in KEY1, updates and export KEY1;Wherein, T1 is the initial distance threshold value of setting;
    Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map task output;Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2;If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.
  5. 5. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 4, it is characterised in that the step 4.2 concretely comprise the following steps:
    Step 4.2.1, start second Map task, read in data in the collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out;
    Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using the formula (1) dist3;
    Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and distance is treated less than T2 therewith Cluster data forms set B, output set B;Wherein, T2 is the initial distance threshold value of setting;
    Step 4.2.4, start second Reduce task, read in several set B of second Map task output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list);Wherein, key represents a central point, and list represents all data in addition to key in same cluster.
  6. 6. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 5, it is characterised in that the step 5 Specially:
    Start the 3rd Map task, read in the cluster of second Reduce task output, read in a cluster every time, K-d tree is constructed in the data of the cluster.
  7. 7. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 6, it is characterised in that the step 7 Concretely comprise the following steps:
    Step 7.1, start the 4th Map task, read in the cluster that the step 6 exports;
    Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C;Judging the cluster of reading is It is no with set C in cluster have identical data, if it is, the cluster read in this is had into the cluster of identical data from set Take out in C, and merged with the cluster of this reading, is added to after the cluster after merging is removed into duplicate data in set C;
    Step 7.3, after all clusters being read into the 4th Map task have been handled by the step 7.2, output set C;
    Step 7.4, start the 4th Reduce task, read in the cluster in the set C that the step 7.3 exports;Initialization one Individual set D, every time read in one cluster simultaneously be added in set D, judge read in every time cluster whether with the cluster in set D There are identical data, if it is, the cluster that the cluster with reading has identical data is taken out from set D, and it is poly- with reading Class merges, and is added to after the cluster after merging is removed into duplicate data in set D;
    Step 7.5, after all clusters being read into the 4th Reduce task have been handled by the step 7.4, output set D, as cluster result.
CN201710495491.5A 2017-06-26 2017-06-26 C-DBSCAN-K clustering algorithm under Hadoop platform Active CN107341210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710495491.5A CN107341210B (en) 2017-06-26 2017-06-26 C-DBSCAN-K clustering algorithm under Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710495491.5A CN107341210B (en) 2017-06-26 2017-06-26 C-DBSCAN-K clustering algorithm under Hadoop platform

Publications (2)

Publication Number Publication Date
CN107341210A true CN107341210A (en) 2017-11-10
CN107341210B CN107341210B (en) 2020-07-31

Family

ID=60221100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710495491.5A Active CN107341210B (en) 2017-06-26 2017-06-26 C-DBSCAN-K clustering algorithm under Hadoop platform

Country Status (1)

Country Link
CN (1) CN107341210B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491507A (en) * 2018-03-22 2018-09-04 北京交通大学 A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments
CN109656696A (en) * 2018-12-03 2019-04-19 华南师范大学 A kind of processing method that data API is efficiently called
CN110334725A (en) * 2019-04-22 2019-10-15 国家电网有限公司 Thunderstorm clustering method, device, computer equipment and the storage medium of lightning data
CN110493221A (en) * 2019-08-19 2019-11-22 四川大学 A kind of network anomaly detection method based on the profile that clusters
CN112579581A (en) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 Data access method and system of data analysis engine

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060013442A1 (en) * 2004-07-15 2006-01-19 Harris Corporation Bare earth digital elevation model extraction for three-dimensional registration from topographical points
US20140169673A1 (en) * 2011-07-29 2014-06-19 Ke-Yan Liu Incremental image clustering
CN103955685A (en) * 2014-04-22 2014-07-30 西安理工大学 Edge tracing digital recognition method
US20150039619A1 (en) * 2012-03-19 2015-02-05 Microsoft Corporation Grouping documents and data objects via multi-center canopy clustering
CN104517052A (en) * 2014-12-09 2015-04-15 中国科学院深圳先进技术研究院 Invasion detection method and device
CN104933156A (en) * 2015-06-25 2015-09-23 西安理工大学 Collaborative filtering method based on shared neighbor clustering
US20160004762A1 (en) * 2014-07-07 2016-01-07 Edward-Robert Tyercha Hilbert Curve Partitioning for Parallelization of DBSCAN
US9286391B1 (en) * 2012-03-19 2016-03-15 Amazon Technologies, Inc. Clustering and recommending items based upon keyword analysis
CN105550368A (en) * 2016-01-22 2016-05-04 浙江大学 Approximate nearest neighbor searching method and system of high dimensional data
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060013442A1 (en) * 2004-07-15 2006-01-19 Harris Corporation Bare earth digital elevation model extraction for three-dimensional registration from topographical points
US20140169673A1 (en) * 2011-07-29 2014-06-19 Ke-Yan Liu Incremental image clustering
US20150039619A1 (en) * 2012-03-19 2015-02-05 Microsoft Corporation Grouping documents and data objects via multi-center canopy clustering
US9286391B1 (en) * 2012-03-19 2016-03-15 Amazon Technologies, Inc. Clustering and recommending items based upon keyword analysis
CN103955685A (en) * 2014-04-22 2014-07-30 西安理工大学 Edge tracing digital recognition method
US20160004762A1 (en) * 2014-07-07 2016-01-07 Edward-Robert Tyercha Hilbert Curve Partitioning for Parallelization of DBSCAN
CN104517052A (en) * 2014-12-09 2015-04-15 中国科学院深圳先进技术研究院 Invasion detection method and device
CN104933156A (en) * 2015-06-25 2015-09-23 西安理工大学 Collaborative filtering method based on shared neighbor clustering
CN105550368A (en) * 2016-01-22 2016-05-04 浙江大学 Approximate nearest neighbor searching method and system of high dimensional data
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程堃: "基于云平台的聚类算法并行化研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491507A (en) * 2018-03-22 2018-09-04 北京交通大学 A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments
CN109656696A (en) * 2018-12-03 2019-04-19 华南师范大学 A kind of processing method that data API is efficiently called
CN109656696B (en) * 2018-12-03 2020-10-16 华南师范大学 Processing method for efficient calling of data API
CN110334725A (en) * 2019-04-22 2019-10-15 国家电网有限公司 Thunderstorm clustering method, device, computer equipment and the storage medium of lightning data
CN110493221A (en) * 2019-08-19 2019-11-22 四川大学 A kind of network anomaly detection method based on the profile that clusters
CN110493221B (en) * 2019-08-19 2020-04-28 四川大学 Network anomaly detection method based on clustering contour
CN112579581A (en) * 2020-11-30 2021-03-30 贵州力创科技发展有限公司 Data access method and system of data analysis engine
CN112579581B (en) * 2020-11-30 2023-04-14 贵州力创科技发展有限公司 Data access method and system of data analysis engine

Also Published As

Publication number Publication date
CN107341210B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN109564568B (en) Apparatus, method and machine-readable storage medium for distributed dataset indexing
CN107341210A (en) C DBSCAN K clustering algorithms under Hadoop platform
He et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce
CN110134714B (en) Distributed computing framework cache index method suitable for big data iterative computation
Salinas et al. Data warehouse and big data integration
Hajeer et al. Handling big data using a data-aware HDFS and evolutionary clustering technique
US20210263949A1 (en) Computerized pipelines for transforming input data into data structures compatible with models
Yin et al. Parallel implementing improved k-means applied for image retrieval and anomaly detection
US11809460B1 (en) Systems, methods, and graphical user interfaces for taxonomy-based classification of unlabeled structured datasets
Graham et al. Finding and visualizing graph clusters using pagerank optimization
Sundarakumar et al. A heuristic approach to improve the data processing in big data using enhanced Salp Swarm algorithm (ESSA) and MK-means algorithm
Sergey et al. Applying map-reduce paradigm for parallel closed cube computation
CN103823881B (en) The method and device of the performance optimization of distributed data base
Qi et al. Clustering remote RDF data using SPARQL update queries
Wan et al. Dgs: Communication-efficient graph sampling for distributed gnn training
Arunachalam et al. A survey on web service clustering
Tang et al. Design of a data processing method for the farmland environmental monitoring based on improved Spark components
Papanikolaou Distributed algorithms for skyline computation using apache spark
CN113505600B (en) Distributed indexing method of industrial chain based on semantic concept space
CN111274243B (en) Information processing method and system based on multidimensional model form
Aljanabi et al. Large Dataset Classification Using Parallel Processing Concept
US11809915B1 (en) Parallel processing techniques for expediting reconciliation for a hierarchy of forecasts on a computer system
Yushui et al. K-means clustering algorithm for large-scale Chinese commodity information web based on Hadoop
Asnani A distributed k-mean clustering algorithm for cloud data mining
Tan et al. A Novel Association Rules Mining Based on Improved Fusion Particle Swarm Optimization Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200703

Address after: 510000 Room 302, building 2, No. 10-16, taihegang Road, Yuexiu District, Guangzhou City, Guangdong Province

Applicant after: SUNMNET TECHNOLOGY Co.,Ltd.

Address before: 710048 Shaanxi city of Xi'an Province Jinhua Road No. 5

Applicant before: XI'AN University OF TECHNOLOGY

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: C-dbscan-k clustering algorithm based on Hadoop platform

Effective date of registration: 20210205

Granted publication date: 20200731

Pledgee: China Co. truction Bank Corp Guangzhou Liwan branch

Pledgor: SUNMNET TECHNOLOGY Co.,Ltd.

Registration number: Y2021980001059

PE01 Entry into force of the registration of the contract for pledge of patent right
PP01 Preservation of patent right

Effective date of registration: 20230919

Granted publication date: 20200731

PP01 Preservation of patent right