CN107341210A - C DBSCAN K clustering algorithms under Hadoop platform - Google Patents
C DBSCAN K clustering algorithms under Hadoop platform Download PDFInfo
- Publication number
- CN107341210A CN107341210A CN201710495491.5A CN201710495491A CN107341210A CN 107341210 A CN107341210 A CN 107341210A CN 201710495491 A CN201710495491 A CN 201710495491A CN 107341210 A CN107341210 A CN 107341210A
- Authority
- CN
- China
- Prior art keywords
- cluster
- data
- read
- dbscan
- files
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Abstract
Description
Claims (7)
- C-DBSCAN-K clustering algorithms under 1.Hadoop platforms, it is characterised in that comprise the following steps:Step 1, multiple stage computers are connected in same LAN, every computer can be mutual as a node, foundation The cluster of communication;Step 2, Hadoop platform is established for the cluster;Step 3, data set A to be clustered is uploaded to the distributed texts of Hadoop using Hadoop distributed document order dfs-put Part system;Step 4, Canopy clustering algorithms are performed and initial clustering is carried out to the data in the data set A to be clustered, obtain coarse grain The cluster result of degree;Step 5, k-d tree is constructed in the cluster that the step 4 obtains;Step 6, the cluster that is obtained to the step 4 performs DBSCAN algorithms, in query process, uses what the step 5 constructed ε-neighborhood of data object in each cluster of k-d tree inquiry, export the cluster result of DBSCAN algorithms;Step 7, the cluster for having identical data in the step 6 is merged, exports cluster result.
- 2. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 2 are specially:It is each node installation redhat6.2 operating systems in cluster first;Then it is each node installation in cluster Hadoop2.2.0 files, and be each node installation jdk1.8.0_65 files in cluster;Configure in cluster on each node The redhat6.2 systems .bashrc files so that on the redhat6.2 system relationships node Hadoop2.2.0 text Jdk1.8.0_65 files on part and the node;Hadoop- on each node of configuration in Hadoop2.2.0 files Env.sh files, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml files, Hdfs-site.xml files, mapred-site.xml files and yarn-site.xml files.
- 3. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 4 concretely comprise the following steps:Step 4.1, center point set is determined;Step 4.2, the data in data set A to be clustered are clustered according to the center point set.
- 4. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 3, it is characterised in that the step 4.1 concretely comprise the following steps:Step 4.1.1, start first Map task, scan and read in the data in the data set A to be clustered;Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky;To the data read in every time, if KEY1 is Sky, then the data of reading are added in KEY1;If KEY1 is not sky, using the data that formula (1) calculating is read in KEY1 In central point distance dist1:Dist1=dist (di,dj)=| xi1-xj2|+|xi2-xj2|+…+|xip-xjp| (1)Wherein, diFor i-th of data in data set A to be clustered, di=(xi1, xi2..., xip), xi1, xi2..., xipFor diP Individual numerical attribute, djFor j-th of central point in KEY1, dj=(xj1, xj2..., xjp), xj1, xj2..., xjpFor djP number Value attribute, dist (di, dj) represent diTo djManhatton distance;Step 4.1.3, if for di, d be presentjSo that dist (di,dj)<T1, then by diIt is added in KEY1, updates and export KEY1;Wherein, T1 is the initial distance threshold value of setting;Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map task output;Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2;If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.
- 5. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 4, it is characterised in that the step 4.2 concretely comprise the following steps:Step 4.2.1, start second Map task, read in data in the collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out;Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using the formula (1) dist3;Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and distance is treated less than T2 therewith Cluster data forms set B, output set B;Wherein, T2 is the initial distance threshold value of setting;Step 4.2.4, start second Reduce task, read in several set B of second Map task output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list);Wherein, key represents a central point, and list represents all data in addition to key in same cluster.
- 6. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 5, it is characterised in that the step 5 Specially:Start the 3rd Map task, read in the cluster of second Reduce task output, read in a cluster every time, K-d tree is constructed in the data of the cluster.
- 7. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 6, it is characterised in that the step 7 Concretely comprise the following steps:Step 7.1, start the 4th Map task, read in the cluster that the step 6 exports;Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C;Judging the cluster of reading is It is no with set C in cluster have identical data, if it is, the cluster read in this is had into the cluster of identical data from set Take out in C, and merged with the cluster of this reading, is added to after the cluster after merging is removed into duplicate data in set C;Step 7.3, after all clusters being read into the 4th Map task have been handled by the step 7.2, output set C;Step 7.4, start the 4th Reduce task, read in the cluster in the set C that the step 7.3 exports;Initialization one Individual set D, every time read in one cluster simultaneously be added in set D, judge read in every time cluster whether with the cluster in set D There are identical data, if it is, the cluster that the cluster with reading has identical data is taken out from set D, and it is poly- with reading Class merges, and is added to after the cluster after merging is removed into duplicate data in set D;Step 7.5, after all clusters being read into the 4th Reduce task have been handled by the step 7.4, output set D, as cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710495491.5A CN107341210B (en) | 2017-06-26 | 2017-06-26 | C-DBSCAN-K clustering algorithm under Hadoop platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710495491.5A CN107341210B (en) | 2017-06-26 | 2017-06-26 | C-DBSCAN-K clustering algorithm under Hadoop platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107341210A true CN107341210A (en) | 2017-11-10 |
CN107341210B CN107341210B (en) | 2020-07-31 |
Family
ID=60221100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710495491.5A Active CN107341210B (en) | 2017-06-26 | 2017-06-26 | C-DBSCAN-K clustering algorithm under Hadoop platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107341210B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491507A (en) * | 2018-03-22 | 2018-09-04 | 北京交通大学 | A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments |
CN109656696A (en) * | 2018-12-03 | 2019-04-19 | 华南师范大学 | A kind of processing method that data API is efficiently called |
CN110334725A (en) * | 2019-04-22 | 2019-10-15 | 国家电网有限公司 | Thunderstorm clustering method, device, computer equipment and the storage medium of lightning data |
CN110493221A (en) * | 2019-08-19 | 2019-11-22 | 四川大学 | A kind of network anomaly detection method based on the profile that clusters |
CN112579581A (en) * | 2020-11-30 | 2021-03-30 | 贵州力创科技发展有限公司 | Data access method and system of data analysis engine |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060013442A1 (en) * | 2004-07-15 | 2006-01-19 | Harris Corporation | Bare earth digital elevation model extraction for three-dimensional registration from topographical points |
US20140169673A1 (en) * | 2011-07-29 | 2014-06-19 | Ke-Yan Liu | Incremental image clustering |
CN103955685A (en) * | 2014-04-22 | 2014-07-30 | 西安理工大学 | Edge tracing digital recognition method |
US20150039619A1 (en) * | 2012-03-19 | 2015-02-05 | Microsoft Corporation | Grouping documents and data objects via multi-center canopy clustering |
CN104517052A (en) * | 2014-12-09 | 2015-04-15 | 中国科学院深圳先进技术研究院 | Invasion detection method and device |
CN104933156A (en) * | 2015-06-25 | 2015-09-23 | 西安理工大学 | Collaborative filtering method based on shared neighbor clustering |
US20160004762A1 (en) * | 2014-07-07 | 2016-01-07 | Edward-Robert Tyercha | Hilbert Curve Partitioning for Parallelization of DBSCAN |
US9286391B1 (en) * | 2012-03-19 | 2016-03-15 | Amazon Technologies, Inc. | Clustering and recommending items based upon keyword analysis |
CN105550368A (en) * | 2016-01-22 | 2016-05-04 | 浙江大学 | Approximate nearest neighbor searching method and system of high dimensional data |
CN106503086A (en) * | 2016-10-11 | 2017-03-15 | 成都云麒麟软件有限公司 | The detection method of distributed local outlier |
-
2017
- 2017-06-26 CN CN201710495491.5A patent/CN107341210B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060013442A1 (en) * | 2004-07-15 | 2006-01-19 | Harris Corporation | Bare earth digital elevation model extraction for three-dimensional registration from topographical points |
US20140169673A1 (en) * | 2011-07-29 | 2014-06-19 | Ke-Yan Liu | Incremental image clustering |
US20150039619A1 (en) * | 2012-03-19 | 2015-02-05 | Microsoft Corporation | Grouping documents and data objects via multi-center canopy clustering |
US9286391B1 (en) * | 2012-03-19 | 2016-03-15 | Amazon Technologies, Inc. | Clustering and recommending items based upon keyword analysis |
CN103955685A (en) * | 2014-04-22 | 2014-07-30 | 西安理工大学 | Edge tracing digital recognition method |
US20160004762A1 (en) * | 2014-07-07 | 2016-01-07 | Edward-Robert Tyercha | Hilbert Curve Partitioning for Parallelization of DBSCAN |
CN104517052A (en) * | 2014-12-09 | 2015-04-15 | 中国科学院深圳先进技术研究院 | Invasion detection method and device |
CN104933156A (en) * | 2015-06-25 | 2015-09-23 | 西安理工大学 | Collaborative filtering method based on shared neighbor clustering |
CN105550368A (en) * | 2016-01-22 | 2016-05-04 | 浙江大学 | Approximate nearest neighbor searching method and system of high dimensional data |
CN106503086A (en) * | 2016-10-11 | 2017-03-15 | 成都云麒麟软件有限公司 | The detection method of distributed local outlier |
Non-Patent Citations (1)
Title |
---|
程堃: "基于云平台的聚类算法并行化研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491507A (en) * | 2018-03-22 | 2018-09-04 | 北京交通大学 | A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments |
CN109656696A (en) * | 2018-12-03 | 2019-04-19 | 华南师范大学 | A kind of processing method that data API is efficiently called |
CN109656696B (en) * | 2018-12-03 | 2020-10-16 | 华南师范大学 | Processing method for efficient calling of data API |
CN110334725A (en) * | 2019-04-22 | 2019-10-15 | 国家电网有限公司 | Thunderstorm clustering method, device, computer equipment and the storage medium of lightning data |
CN110493221A (en) * | 2019-08-19 | 2019-11-22 | 四川大学 | A kind of network anomaly detection method based on the profile that clusters |
CN110493221B (en) * | 2019-08-19 | 2020-04-28 | 四川大学 | Network anomaly detection method based on clustering contour |
CN112579581A (en) * | 2020-11-30 | 2021-03-30 | 贵州力创科技发展有限公司 | Data access method and system of data analysis engine |
CN112579581B (en) * | 2020-11-30 | 2023-04-14 | 贵州力创科技发展有限公司 | Data access method and system of data analysis engine |
Also Published As
Publication number | Publication date |
---|---|
CN107341210B (en) | 2020-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109564568B (en) | Apparatus, method and machine-readable storage medium for distributed dataset indexing | |
CN107341210A (en) | C DBSCAN K clustering algorithms under Hadoop platform | |
He et al. | Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce | |
CN110134714B (en) | Distributed computing framework cache index method suitable for big data iterative computation | |
Salinas et al. | Data warehouse and big data integration | |
Hajeer et al. | Handling big data using a data-aware HDFS and evolutionary clustering technique | |
US20210263949A1 (en) | Computerized pipelines for transforming input data into data structures compatible with models | |
Yin et al. | Parallel implementing improved k-means applied for image retrieval and anomaly detection | |
US11809460B1 (en) | Systems, methods, and graphical user interfaces for taxonomy-based classification of unlabeled structured datasets | |
Graham et al. | Finding and visualizing graph clusters using pagerank optimization | |
Sundarakumar et al. | A heuristic approach to improve the data processing in big data using enhanced Salp Swarm algorithm (ESSA) and MK-means algorithm | |
Sergey et al. | Applying map-reduce paradigm for parallel closed cube computation | |
CN103823881B (en) | The method and device of the performance optimization of distributed data base | |
Qi et al. | Clustering remote RDF data using SPARQL update queries | |
Wan et al. | Dgs: Communication-efficient graph sampling for distributed gnn training | |
Arunachalam et al. | A survey on web service clustering | |
Tang et al. | Design of a data processing method for the farmland environmental monitoring based on improved Spark components | |
Papanikolaou | Distributed algorithms for skyline computation using apache spark | |
CN113505600B (en) | Distributed indexing method of industrial chain based on semantic concept space | |
CN111274243B (en) | Information processing method and system based on multidimensional model form | |
Aljanabi et al. | Large Dataset Classification Using Parallel Processing Concept | |
US11809915B1 (en) | Parallel processing techniques for expediting reconciliation for a hierarchy of forecasts on a computer system | |
Yushui et al. | K-means clustering algorithm for large-scale Chinese commodity information web based on Hadoop | |
Asnani | A distributed k-mean clustering algorithm for cloud data mining | |
Tan et al. | A Novel Association Rules Mining Based on Improved Fusion Particle Swarm Optimization Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200703 Address after: 510000 Room 302, building 2, No. 10-16, taihegang Road, Yuexiu District, Guangzhou City, Guangdong Province Applicant after: SUNMNET TECHNOLOGY Co.,Ltd. Address before: 710048 Shaanxi city of Xi'an Province Jinhua Road No. 5 Applicant before: XI'AN University OF TECHNOLOGY |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: C-dbscan-k clustering algorithm based on Hadoop platform Effective date of registration: 20210205 Granted publication date: 20200731 Pledgee: China Co. truction Bank Corp Guangzhou Liwan branch Pledgor: SUNMNET TECHNOLOGY Co.,Ltd. Registration number: Y2021980001059 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PP01 | Preservation of patent right |
Effective date of registration: 20230919 Granted publication date: 20200731 |
|
PP01 | Preservation of patent right |