CN107341210A

CN107341210A - C DBSCAN K clustering algorithms under Hadoop platform

Info

Publication number: CN107341210A
Application number: CN201710495491.5A
Authority: CN
Inventors: 王彬; 安涛; 吕征
Original assignee: Xian University of Technology
Current assignee: SUNMNET TECHNOLOGY Co.,Ltd.
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2017-11-10
Anticipated expiration: 2037-06-26
Also published as: CN107341210B

Abstract

C DBSCAN K clustering algorithms under Hadoop platform, comprise the following steps：Step 1, the cluster that can be in communication with each other is established；Step 2, Hadoop platform is established for cluster；Step 3, ordered using dfs-put and data set A to be clustered is uploaded to HDFS；Step 4, Canopy clustering algorithms are performed and initial clustering is carried out to the data in A, obtain the cluster result of coarseness；Step 5, k d trees are constructed in the cluster that step 4 obtains；Step 6, the cluster obtained to step 4 performs DBSCAN algorithms, using the epsilon neighborhood of data object in k d tree queries each cluster, exports cluster result；Step 7, the cluster for having identical data in step 6 is merged, exports cluster result.It is low that the algorithm of the present invention solves the problems, such as that DBSCAN clustering algorithms present in prior art cluster efficiency on large-scale dataset.

Description

C-DBSCAN-K clustering algorithms under Hadoop platform

Technical field

The invention belongs to computer data digging technology field, the C-DBSCAN-K clusters being related under a kind of Hadoop platform Algorithm.

Background technology

Nowadays, Internet technology is quickly grown, and Internet gos deep into the life of people, and modern society comes into a letter The epoch of breathization, substantial amounts of data message spread all over the place.When in face of the data of magnanimity, top priority is exactly it to be carried out rationally Classification, cluster analysis is just a process that.Using cluster, people can from the data set comprising a large amount of objects intelligence , valuable classificating knowledge can be automatically identified, obtains the distribution of data, observe the difference between different clusters, And on this basis, some specific gatherings are closed and do deeper analysis.Searched in business intelligence, image steganalysis, Web The fields such as rope, have all widely used Clustering Analysis Technology.

However, with developing rapidly for Internet era and widely using for mobile device, data message exponentially level Increase, the clustering algorithm run on traditional, unit can not meet the needs of people in efficiency.Hadoop is distributed flat Platform is the sharp weapon for handling big data, and it provides condition for data mining, how using Hadoop progress data minings, how will Algorithm combination MapReduce model that is traditional, being run on unit carries out Distributed Design, can be distributed using Hadoop Formula platform efficiently handles the data of magnanimity, and tool has very important significance.

The content of the invention

It is an object of the invention to provide the C-DBSCAN-K clustering algorithms under a kind of Hadoop platform, solves prior art Present in DBSCAN clustering algorithms on large-scale dataset cluster efficiency it is low the problem of.

The technical solution adopted in the present invention is the C-DBSCAN-K clustering algorithms under Hadoop platform, including following step Suddenly：

Step 1, multiple stage computers are connected in same LAN, for every computer as a node, foundation can The cluster being in communication with each other；

Step 2, Hadoop platform is established for cluster；

Step 3, data set A to be clustered is uploaded to Hadoop distributions using Hadoop distributed document order dfs-put Formula file system；

Step 4, the data that Canopy clustering algorithms are treated in cluster data collection A are performed and carry out initial clustering, obtain coarse grain The cluster result of degree；

Step 5, k-d tree is constructed in the cluster that step 4 obtains；

Step 6, the cluster that is obtained to step 4 performs DBSCAN algorithms, in query process, the k-d tree that is constructed using step 5 ε-neighborhood of data object in each cluster of inquiry, export the cluster result of DBSCAN algorithms；

Step 7, the cluster for having identical data in step 6 is merged, exports cluster result.

Step 2 is specially：

It is each node installation redhat6.2 operating systems in cluster first；Then it is that each node is pacified in cluster Hadoop2.2.0 files are filled, and are each node installation jdk1.8.0_65 files in cluster；Configure each node in cluster On redhat6.2 systems .bashrc files so that the Hadoop2.2.0 files on the redhat6.2 system relationships node With the jdk1.8.0_65 files on the node；Configure the hadoop-env.sh texts on each node in Hadoop2.2.0 files Part, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml files, hdfs- Site.xml files, mapred-site.xml files and yarn-site.xml files.

Step 4 concretely comprises the following steps：

Step 4.1, center point set is determined；

Step 4.2, the data in data set A to be clustered are clustered according to center point set.

Step 4.1 concretely comprises the following steps：

Step 4.1.1, start first Map task, scan and read in the data in data set A to be clustered；

Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky；To the data read in every time, if KEY1 is sky, then the data of reading is added in KEY1；If KEY1 is not sky, the data read in are calculated using formula (1) The distance dist1 of central point into KEY1：

Dist1=dist (d_i,d_j)=| x_i1-x_j2|+|x_i2-x_j2|+…+|x_ip-x_jp| (1)

Wherein, d_iFor i-th of data in data set A to be clustered, d_i=(x_i1, x_i2..., x_ip), x_i1, x_i2..., x_ipFor d_iP numerical attribute, d_jFor j-th of central point in KEY1, d_j=(x_j1, x_j2..., x_jp), x_j1, x_j2..., x_jpFor d_jP Individual numerical attribute, dist (d_i, d_j) represent d_iTo d_jManhatton distance；

Step 4.1.3, if for d_i, d be present_jSo that dist (d_i,d_j)<T1, then by d_iIt is added in KEY1, renewal is simultaneously Export KEY1；Wherein, T1 is the initial distance threshold value of setting；

Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map tasks output；Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2；If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.

Step 4.2 concretely comprises the following steps：

Step 4.2.1, start second Map task, read in data in collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out；

Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using formula (1) dist3；

Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and therewith distance be less than T2 Data to be clustered form set B, output set B；Wherein, T2 is the initial distance threshold value of setting；

Step 4.2.4, start second Reduce task, read in several set B of second Map tasks output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list)；Wherein, key represents a central point, and list represents all data in addition to key in same cluster.

Step 5 is specially：Start the 3rd Map task, read in the cluster of second Reduce tasks output, read in every time One cluster, k-d tree is constructed in the data of the cluster.

Step 7 concretely comprises the following steps：

Step 7.1, start the 4th Map task, read in the cluster that step 6 exports；

Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C；Judge to read in every time Cluster whether with the cluster in set C have identical data, if it is, by the cluster with reading have the cluster of identical data from Take out in set C, and merged with the cluster of reading, is added to after the cluster after merging is removed into duplicate data in set C；

Step 7.3, after all clusters being read into the 4th Map task have been handled by step 7.2, output set C；

Step 7.4, start the 4th Reduce task, read in the cluster in the set C that step 7.3 exports；Initialization one Individual set D, reads in one and clusters and be simultaneously added in set D every time, and whether clustering for judging to read in the cluster in set D has phase Same data, if it is, the cluster that the cluster read in this has identical data is taken out from set D, and read in this Cluster merge, be added to after the cluster after merging is removed into duplicate data in set D；

Step 7.5, after all clusters being read into the 4th Reduce task have been handled by step 7.4, output set D, as cluster result.

The invention has the advantages that the C-DBSCAN-K clustering algorithms under Hadoop platform of the present invention, first, are used Canopy clustering algorithms, it is quickly obtained the cluster result of coarseness；Then, k-d tree is constructed on the cluster result of coarseness Data structure, and DBSCAN algorithms are performed, using ε-contiguous range of k-d tree query object, the operation for accelerating DBSCAN is fast Degree；Finally, merge the cluster with same object, obtain final cluster result.C-DBSCAN-K under Hadoop platform gathers Class algorithm is fast and effective when handling large data sets, in the case where keeping the cluster degree of accuracy not reduce, significantly improves The execution efficiency of DBSCAN algorithms.

Brief description of the drawings

Fig. 1 is the flow chart of the C-DBSCAN-K clustering algorithms under Hadoop platform；

Fig. 2 is C-DBSCAN-K clustering algorithms under the Hadoop platform figure compared with the cluster result of DBSCAN algorithms；

Fig. 3 is C-DBSCAN-K clustering algorithms under the Hadoop platform figure compared with the execution time of DBSCAN algorithms.

Embodiment

The present invention is described in detail with reference to the accompanying drawings and detailed description.

As shown in figure 1, the C-DBSCAN-K clustering algorithms under Hadoop platform, comprise the following steps：

Step 2, Hadoop platform is established for cluster；

Step 2 is specially：It is each node installation redhat6.2 operating systems in cluster first；Then it is in cluster Each node installation Hadoop2.2.0 files, and be each node installation jdk1.8.0_65 files in cluster；Config set The .bashrc files of redhat6.2 systems in group on each node so that on the redhat6.2 system relationships node Jdk1.8.0_65 files on Hadoop2.2.0 files and the node；Configure on each node in Hadoop2.2.0 files Hadoop-env.sh files, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml File, hdfs-site.xml files, mapred-site.xml files and yarn-site.xml files.

Step 4 concretely comprises the following steps：

Step 4.1, center point set is determined；

Step 4.1 concretely comprises the following steps：

Dist1=dist (d_i,d_j)=| x_i1-x_j2|+|x_i2-x_j2|+…+|x_ip-x_jp| (1)

Step 4.2, the data in data set A to be clustered are clustered according to center point set KEY2.

Step 4.2 concretely comprises the following steps：

Step 5, k-d tree is constructed in the cluster that step 4 obtains；

Step 7 concretely comprises the following steps：

Step 7.1, start the 4th Map task, read in the cluster that step 6 exports；

Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C；Judge to read in gathers Whether class with the cluster in set C has identical data, if it is, by the cluster read in this have the cluster of identical data from Take out in set C, and merged with the cluster of this reading, be added to set after the cluster after merging is removed into duplicate data In C；

Step 7.4, start the 4th Reduce task, read in the cluster in the set C that step 7.3 exports；Initialization one Individual set D, every time read in one cluster simultaneously be added in set D, judge read in every time cluster whether with the cluster in set D There are identical data, if it is, the cluster that the cluster with reading has identical data is taken out from set D, and it is poly- with reading Class merges, and is added to after the cluster after merging is removed into duplicate data in set D；

The four groups of simulated data sets generated with R language packs：Face data sets, spirals data sets, cassini data sets, Exemplified by hypercube data sets are respectively as data set to be clustered, the C-DBSCAN-K clusters under Hadoop platform are respectively adopted The data that algorithm is concentrated with DBSCAN algorithms to 4 data cluster；Fig. 2 (a), Fig. 2 (b) are the C- under Hadoop platform DBSCAN-K clustering algorithms and operation result of the DBSCAN algorithms on face data sets, Fig. 2 (c), Fig. 2 (d) put down for Hadoop C-DBSCAN-K clustering algorithms and operation result of the DBSCAN algorithms on spirals data sets under platform, Fig. 2 (e), Fig. 2 (f) For the C-DBSCAN-K clustering algorithms under Hadoop platform and operation result of the DBSCAN algorithms on cassini data sets, Fig. 2 (g), Fig. 2 (h) is the C-DBSCAN-K clustering algorithms under Hadoop platform and DBSCAN algorithms on hypercube data sets Operation result；Rectangle frame in Fig. 2 represents noise spot, and the shape of other different gray scales represents different clusters, can be obtained by Fig. 2 Go out：C-DBSCAN-K clustering algorithms under Hadoop platform are identical with the cluster that DBSCAN algorithms generate on each data set , and the noise spot identified is also identical, i.e., and the accuracy rate of two kinds algorithms is identical.

As shown in figure 3, the C-DBSCAN-K clustering algorithms and DBSCAN algorithms under Hadoop platform are to including 120k bars The face data sets of data, comprising 150k data spirals data sets, comprising 180k data cassini data sets, include When 200k data hypercube data sets are clustered, the C-DBSCAN-K clustering algorithms under Hadoop platform compare DBSCAN Riming time of algorithm is short, and the C-DBSCAN-K clustering algorithms under Hadoop platform have higher execution efficiency.

In summary, the C-DBSCAN-K clustering algorithms under Hadoop platform of the present invention, first, clustered and calculated using Canopy Method, it is quickly obtained the cluster result of coarseness；Then, k-d tree data structure is constructed on the cluster result of coarseness, and is held Row DBSCAN algorithms, using ε-contiguous range of k-d tree query object, accelerate the DBSCAN speed of service；Finally, tool is merged There is the cluster of same object, obtain final cluster result.C-DBSCAN-K clustering algorithms under Hadoop platform are big in processing It is fast and effective during data set, in the case where keeping the cluster degree of accuracy not reduce, significantly improve the execution of DBSCAN algorithms Efficiency.

Claims

C-DBSCAN-K clustering algorithms under 1.Hadoop platforms, it is characterised in that comprise the following steps：

Step 1, multiple stage computers are connected in same LAN, every computer can be mutual as a node, foundation The cluster of communication；

Step 2, Hadoop platform is established for the cluster；

Step 3, data set A to be clustered is uploaded to the distributed texts of Hadoop using Hadoop distributed document order dfs-put Part system；

Step 4, Canopy clustering algorithms are performed and initial clustering is carried out to the data in the data set A to be clustered, obtain coarse grain The cluster result of degree；

Step 5, k-d tree is constructed in the cluster that the step 4 obtains；

Step 6, the cluster that is obtained to the step 4 performs DBSCAN algorithms, in query process, uses what the step 5 constructed ε-neighborhood of data object in each cluster of k-d tree inquiry, export the cluster result of DBSCAN algorithms；

Step 7, the cluster for having identical data in the step 6 is merged, exports cluster result.
2. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 2 are specially：

It is each node installation redhat6.2 operating systems in cluster first；Then it is each node installation in cluster Hadoop2.2.0 files, and be each node installation jdk1.8.0_65 files in cluster；Configure in cluster on each node The redhat6.2 systems .bashrc files so that on the redhat6.2 system relationships node Hadoop2.2.0 text Jdk1.8.0_65 files on part and the node；Hadoop- on each node of configuration in Hadoop2.2.0 files Env.sh files, mapred-env.sh files, yarn-env.sh files, slaves files, core-site.xml files, Hdfs-site.xml files, mapred-site.xml files and yarn-site.xml files.
3. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 1, it is characterised in that the step 4 concretely comprise the following steps：

Step 4.1, center point set is determined；

Step 4.2, the data in data set A to be clustered are clustered according to the center point set.
4. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 3, it is characterised in that the step 4.1 concretely comprise the following steps：

Step 4.1.1, start first Map task, scan and read in the data in the data set A to be clustered；

Step 4.1.2, a center point set KEY1 is initialized, make KEY1 as sky；To the data read in every time, if KEY1 is Sky, then the data of reading are added in KEY1；If KEY1 is not sky, using the data that formula (1) calculating is read in KEY1 In central point distance dist1：

Dist1=dist (d_i,d_j)=| x_i1-x_j2|+|x_i2-x_j2|+…+|x_ip-x_jp| (1)

Wherein, d_iFor i-th of data in data set A to be clustered, d_i=(x_i1, x_i2..., x_ip), x_i1, x_i2..., x_ipFor d_iP Individual numerical attribute, d_jFor j-th of central point in KEY1, d_j=(x_j1, x_j2..., x_jp), x_j1, x_j2..., x_jpFor d_jP number Value attribute, dist (d_i, d_j) represent d_iTo d_jManhatton distance；

Step 4.1.3, if for d_i, d be present_jSo that dist (d_i,d_j)<T1, then by d_iIt is added in KEY1, updates and export KEY1；Wherein, T1 is the initial distance threshold value of setting；

Step 4.1.4, start first Reduce task, read in the data in the KEY1 of first Map task output；Just One center point set KEY2 of beginningization, KEY2 is made as sky, to the data read in every time, if KEY2 is sky, by the number of reading According to being added in KEY2；If KEY2 is not sky, the distance of the central point in the data to KEY2 read in is calculated using formula (1) Dist2, if there is central point so that dist2<T1, the data of this reading are added in KEY2, updates and exports KEY2.
5. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 4, it is characterised in that the step 4.2 concretely comprise the following steps：

Step 4.2.1, start second Map task, read in data in the collection A to be clustered and first Reduce task is defeated Data in the KEY2 gone out；

Step 4.2.2, the distance of the central point in the data to KEY2 in collection A to be clustered is calculated using the formula (1) dist3；

Step 4.2.3, cause dist3 if there is the central point in KEY2<T2, the central point and distance is treated less than T2 therewith Cluster data forms set B, output set B；Wherein, T2 is the initial distance threshold value of setting；

Step 4.2.4, start second Reduce task, read in several set B of second Map task output, will Data in set B with identical central point not equal to central point are added in same cluster, output cluster (key, list)；Wherein, key represents a central point, and list represents all data in addition to key in same cluster.
6. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 5, it is characterised in that the step 5 Specially：

Start the 3rd Map task, read in the cluster of second Reduce task output, read in a cluster every time, K-d tree is constructed in the data of the cluster.
7. the C-DBSCAN-K clustering algorithms under Hadoop platform according to claim 6, it is characterised in that the step 7 Concretely comprise the following steps：

Step 7.1, start the 4th Map task, read in the cluster that the step 6 exports；

Step 7.2, a set C is initialized, one is read in every time and clusters and be added in set C；Judging the cluster of reading is It is no with set C in cluster have identical data, if it is, the cluster read in this is had into the cluster of identical data from set Take out in C, and merged with the cluster of this reading, is added to after the cluster after merging is removed into duplicate data in set C；

Step 7.3, after all clusters being read into the 4th Map task have been handled by the step 7.2, output set C；

Step 7.4, start the 4th Reduce task, read in the cluster in the set C that the step 7.3 exports；Initialization one Individual set D, every time read in one cluster simultaneously be added in set D, judge read in every time cluster whether with the cluster in set D There are identical data, if it is, the cluster that the cluster with reading has identical data is taken out from set D, and it is poly- with reading Class merges, and is added to after the cluster after merging is removed into duplicate data in set D；

Step 7.5, after all clusters being read into the 4th Reduce task have been handled by the step 7.4, output set D, as cluster result.