CN103544328A - Parallel k mean value clustering method based on Hadoop - Google Patents
Parallel k mean value clustering method based on Hadoop Download PDFInfo
- Publication number
- CN103544328A CN103544328A CN201310568611.1A CN201310568611A CN103544328A CN 103544328 A CN103544328 A CN 103544328A CN 201310568611 A CN201310568611 A CN 201310568611A CN 103544328 A CN103544328 A CN 103544328A
- Authority
- CN
- China
- Prior art keywords
- data
- hadoop
- method based
- parallel
- clustering method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 239000003638 chemical reducing agent Substances 0.000 claims abstract description 8
- 238000003064 k means clustering Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention discloses a parallel k mean value clustering method based on a Hadoop. The parallel k mean value clustering method based on the Hadoop comprises the following steps that data are preprocessed and initial centers of k classes are selected; the distances between data objects and all the centers are calculated at Mapper ends of all the nodes of the Hadoop platform; the center which has the shortest distance from the Mapper ends is selected and local data are sent to Combiner ends of all nodes of the Hadoop platform; the data objects of the same center are converged at the Combiner ends and the data objects of the same center are summed up; all local data of the same cluster are transmitted to Reducer ends of all the nodes of the Hadoop platform; local data of all clusters are gathered at the Reducer ends and new centers of all the clusters are calculated; the steps are repeated until a convergent clustering result is obtained. The parallel k mean value clustering method based on the Hadoop can effectively solve the clustering problem of mass data and greatly improve the clustering speed.
Description
Technical field
The present invention relates to a kind of parallel k means clustering method based on Hadoop.
Background technology
K mean algorithm is most widely used division methods in clustering method.It,, by specifying clusters number k value, is divided data set, generates cluster, is a kind of clustering algorithm of known cluster classification number.The feature of this algorithm is that iteration carries out that distance between data object is calculated and the calculating of cluster mean value, when data object no longer changes in the position of cluster, can finishing iteration obtain cluster result.This algorithm needs constantly data object to be distributed to adjustment in iterative process as can be seen here, constantly calculates the cluster Xin center after adjusting.When data volume scale is magnanimity rank, to the operational efficiency of algorithm itself, be a test, will be difficult to practical requirement the working time of algorithm, therefore need to consider to improve the travelling speed of algorithm.In addition, magnanimity scale data are also a kind of tests for processing speed and the internal memory of computing machine.The computing needing when algorithm carries out iteration mainly contains two parts: the one, calculate the distance between each data object and all cluster centres; The 2nd, generate new cluster centre, need to average to all data objects in each cluster.Along with continuing to increase of data scale, the distance between data object and cluster centre is calculated required time overhead also to be increased.
Summary of the invention
Goal of the invention: the problem and shortage existing for above-mentioned prior art, the object of this invention is to provide a kind of parallel k means clustering method based on Hadoop, can effectively solve the clustering problem of mass data, greatly improve the speed of cluster.
Technical scheme: for achieving the above object, the technical solution used in the present invention is a kind of parallel k means clustering method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) select the initial center of k class;
(3) at the Mapper of each node of Hadoop platform end, calculate each data object to the distance of all initial center;
(4) in the central point ,Jiang local data of described Mapper end chosen distance minimum, send to the Combiner end of each node of Hadoop platform;
(5) at described Combiner end, a set of data objects that belongs to same central point is combined, calculate belong to same central point data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform;
(6) at described Reducer end, gather the local data of all clusters, calculate all cluster Xin center;
(7) repeated execution of steps (3) is to (6), until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.
Further, described local data is (key, value), wherein centered by key, and the subset that value is data object.
Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate k mean data, and then obtains the method for overall k cluster centre.Experimental result shows, the inventive method can promote cluster efficiency greatly, for the classification problem of processing large-scale data, has good effect, has good speed-up ratio, the requirement of reduction method to internal memory and kernel processes ability.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is the comparison schematic diagram of the theoretical speed-up ratio of the inventive method and actual speed-up ratio;
Fig. 3 is the working time schematic diagram of the inventive method on different nodes.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, describe the step of the inventive method below in detail:
As shown in Figures 2 and 3, the speed-up ratio that the inventive method possesses in the situation that nodes is different in the situation that data volume is very large and the gap of desirable speed-up ratio are very little, almost as broad as long in the situation that nodes is no more than 3.
Claims (2)
1. the parallel k means clustering method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) select the initial center of k class;
(3) at the Mapper of each node of Hadoop platform end, calculate each data object to the distance of all initial center;
(4) in the central point ,Jiang local data of described Mapper end chosen distance minimum, send to the Combiner end of each node of Hadoop platform;
(5) at described Combiner end, a set of data objects that belongs to same central point is combined, calculate belong to same central point data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform;
(6) at described Reducer end, gather the local data of all clusters, calculate all cluster Xin center;
(7) repeated execution of steps (3) is to (6), until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.
2. a kind of parallel k means clustering method based on Hadoop according to claim 1, is characterized in that: described local data is (key, value), wherein centered by key, and the subset that value is data object.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310568611.1A CN103544328A (en) | 2013-11-15 | 2013-11-15 | Parallel k mean value clustering method based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310568611.1A CN103544328A (en) | 2013-11-15 | 2013-11-15 | Parallel k mean value clustering method based on Hadoop |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103544328A true CN103544328A (en) | 2014-01-29 |
Family
ID=49967780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310568611.1A Pending CN103544328A (en) | 2013-11-15 | 2013-11-15 | Parallel k mean value clustering method based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103544328A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138661A (en) * | 2015-09-02 | 2015-12-09 | 西北大学 | Hadoop-based k-means clustering analysis system and method of network security log |
CN105183855A (en) * | 2015-09-08 | 2015-12-23 | 浪潮(北京)电子信息产业有限公司 | Information classification method and system |
CN105760478A (en) * | 2016-02-15 | 2016-07-13 | 中山大学 | Large-scale distributed data clustering method based on machine learning |
WO2016123808A1 (en) * | 2015-02-06 | 2016-08-11 | 华为技术有限公司 | Data processing system, calculation node and data processing method |
CN107392239A (en) * | 2017-07-11 | 2017-11-24 | 南京邮电大学 | A kind of K Means algorithm optimization methods based on Spark computation models |
-
2013
- 2013-11-15 CN CN201310568611.1A patent/CN103544328A/en active Pending
Non-Patent Citations (2)
Title |
---|
向小军 等: "基于Hadoop平台的海量文本分类的并行化", 《计算机科学》 * |
张明辉: "基于Hadoop的数据挖掘算法的分析与研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016123808A1 (en) * | 2015-02-06 | 2016-08-11 | 华为技术有限公司 | Data processing system, calculation node and data processing method |
CN106062732A (en) * | 2015-02-06 | 2016-10-26 | 华为技术有限公司 | Data processing system, calculation node and data processing method |
CN106062732B (en) * | 2015-02-06 | 2019-03-01 | 华为技术有限公司 | Data processing system, calculate node and the method for data processing |
US10567494B2 (en) | 2015-02-06 | 2020-02-18 | Huawei Technologies Co., Ltd. | Data processing system, computing node, and data processing method |
CN105138661A (en) * | 2015-09-02 | 2015-12-09 | 西北大学 | Hadoop-based k-means clustering analysis system and method of network security log |
CN105138661B (en) * | 2015-09-02 | 2018-10-30 | 西北大学 | A kind of network security daily record k-means cluster analysis systems and method based on Hadoop |
CN105183855A (en) * | 2015-09-08 | 2015-12-23 | 浪潮(北京)电子信息产业有限公司 | Information classification method and system |
CN105760478A (en) * | 2016-02-15 | 2016-07-13 | 中山大学 | Large-scale distributed data clustering method based on machine learning |
CN107392239A (en) * | 2017-07-11 | 2017-11-24 | 南京邮电大学 | A kind of K Means algorithm optimization methods based on Spark computation models |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | Scalability of control planes for software defined networks: Modeling and evaluation | |
CN103544328A (en) | Parallel k mean value clustering method based on Hadoop | |
Liang et al. | Solving the blocking flow shop scheduling problem by a dynamic multi-swarm particle swarm optimizer | |
Wang et al. | Accelerating federated learning with cluster construction and hierarchical aggregation | |
CN103095598B (en) | Monitor data polymerization under a kind of large-scale cluster environment | |
CN103942108A (en) | Resource parameter optimization method under Hadoop homogenous cluster | |
CN105550238A (en) | Architecture system of database appliance | |
CN101631395B (en) | Method for removing interference noise from moving object locating in wireless sensor network | |
Sepúlveda et al. | 3DMIA: A multi-objective artificial immune algorithm for 3D-MPSoC multi-application 3D-NoC mapping | |
Chen et al. | An improved particle swarm optimization algorithm based on centroid and exponential inertia weight | |
CN103984832A (en) | Simulation analysis method for electric field of aluminum electrolysis cell | |
Yang et al. | A self-adaptive sliding window technique for mining data streams | |
Al Rasyid et al. | LEACH Partition Topology for Wireless Sensor Network | |
CN111222680A (en) | Wind power station output ultra-short-term prediction method based on least square support vector machine | |
Thirumurugan et al. | Ex-PAC: An improved clustering technique for ad hoc network | |
Yuejuan et al. | Task scheduling algorithm based on reliability perception in cloud computing | |
CN105337759B (en) | It is a kind of based on inside and outside community structure than measure and community discovery method | |
Gu et al. | A random distribution harmony search algorithm | |
Jin et al. | Computation offloading and resource allocation for MEC in C-RAN: A deep reinforcement learning approach | |
CN109802440B (en) | Offshore wind farm equivalence method, system and device based on wake effect factor | |
CN104168572A (en) | Artificial physics optimization cognitive radio network spectrum distribution method | |
CN103942368A (en) | Structure design method for laser cutting machine tool | |
CN105631920A (en) | Sample simplifying method of radial basis function support points | |
Du et al. | FASTBEE: A fast and self-adaptive clustering algorithm towards to edge computing | |
CN102622446A (en) | Hadoop based parallel k nearest neighbor classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140129 |