CN103544328A - Parallel k mean value clustering method based on Hadoop - Google Patents

Parallel k mean value clustering method based on Hadoop Download PDF

Info

Publication number
CN103544328A
CN103544328A CN201310568611.1A CN201310568611A CN103544328A CN 103544328 A CN103544328 A CN 103544328A CN 201310568611 A CN201310568611 A CN 201310568611A CN 103544328 A CN103544328 A CN 103544328A
Authority
CN
China
Prior art keywords
data
hadoop
method based
parallel
clustering method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310568611.1A
Other languages
Chinese (zh)
Inventor
高阳
王睿
史颖欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Nanjing University
Original Assignee
JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd, Nanjing University filed Critical JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Priority to CN201310568611.1A priority Critical patent/CN103544328A/en
Publication of CN103544328A publication Critical patent/CN103544328A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a parallel k mean value clustering method based on a Hadoop. The parallel k mean value clustering method based on the Hadoop comprises the following steps that data are preprocessed and initial centers of k classes are selected; the distances between data objects and all the centers are calculated at Mapper ends of all the nodes of the Hadoop platform; the center which has the shortest distance from the Mapper ends is selected and local data are sent to Combiner ends of all nodes of the Hadoop platform; the data objects of the same center are converged at the Combiner ends and the data objects of the same center are summed up; all local data of the same cluster are transmitted to Reducer ends of all the nodes of the Hadoop platform; local data of all clusters are gathered at the Reducer ends and new centers of all the clusters are calculated; the steps are repeated until a convergent clustering result is obtained. The parallel k mean value clustering method based on the Hadoop can effectively solve the clustering problem of mass data and greatly improve the clustering speed.

Description

A kind of parallel k means clustering method based on Hadoop
Technical field
The present invention relates to a kind of parallel k means clustering method based on Hadoop.
Background technology
K mean algorithm is most widely used division methods in clustering method.It,, by specifying clusters number k value, is divided data set, generates cluster, is a kind of clustering algorithm of known cluster classification number.The feature of this algorithm is that iteration carries out that distance between data object is calculated and the calculating of cluster mean value, when data object no longer changes in the position of cluster, can finishing iteration obtain cluster result.This algorithm needs constantly data object to be distributed to adjustment in iterative process as can be seen here, constantly calculates the cluster Xin center after adjusting.When data volume scale is magnanimity rank, to the operational efficiency of algorithm itself, be a test, will be difficult to practical requirement the working time of algorithm, therefore need to consider to improve the travelling speed of algorithm.In addition, magnanimity scale data are also a kind of tests for processing speed and the internal memory of computing machine.The computing needing when algorithm carries out iteration mainly contains two parts: the one, calculate the distance between each data object and all cluster centres; The 2nd, generate new cluster centre, need to average to all data objects in each cluster.Along with continuing to increase of data scale, the distance between data object and cluster centre is calculated required time overhead also to be increased.
Summary of the invention
Goal of the invention: the problem and shortage existing for above-mentioned prior art, the object of this invention is to provide a kind of parallel k means clustering method based on Hadoop, can effectively solve the clustering problem of mass data, greatly improve the speed of cluster.
Technical scheme: for achieving the above object, the technical solution used in the present invention is a kind of parallel k means clustering method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) select the initial center of k class;
(3) at the Mapper of each node of Hadoop platform end, calculate each data object to the distance of all initial center;
(4) in the central point ,Jiang local data of described Mapper end chosen distance minimum, send to the Combiner end of each node of Hadoop platform;
(5) at described Combiner end, a set of data objects that belongs to same central point is combined, calculate belong to same central point data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform;
(6) at described Reducer end, gather the local data of all clusters, calculate all cluster Xin center;
(7) repeated execution of steps (3) is to (6), until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.
Further, described local data is (key, value), wherein centered by key, and the subset that value is data object.
Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate k mean data, and then obtains the method for overall k cluster centre.Experimental result shows, the inventive method can promote cluster efficiency greatly, for the classification problem of processing large-scale data, has good effect, has good speed-up ratio, the requirement of reduction method to internal memory and kernel processes ability.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is the comparison schematic diagram of the theoretical speed-up ratio of the inventive method and actual speed-up ratio;
Fig. 3 is the working time schematic diagram of the inventive method on different nodes.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, describe the step of the inventive method below in detail:
Step 1, data pre-service, for example, calculate TF-IDF (the term frequency – inverse document frequency) value of text data.
Step 2, suitably selects the initial cluster center of k class, for example, choose at random.
Step 3, (for example, for text data, calculates all the other chordal distances at the Mapper of each node of Hadoop platform end computational data object to the distance of all initial cluster centers; For vectorial type data, calculate its Euclidean distance).
Step 4, is key at the central point ,Yi center of described Mapper end chosen distance minimum, and the subset of data object is value, forms data pair, is called the Combiner end that ,Jiang local data of local data sends to each node of Hadoop platform.
Step 5, is combined belonging to concentric set of data objects at described Combiner end, calculate belong to concentric data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform.
Step 6, holds the local data that gathers all clusters at described Reducer, calculate all cluster Xin center.
Step 7, repeated execution of steps 3 to 6, until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.
As shown in Figures 2 and 3, the speed-up ratio that the inventive method possesses in the situation that nodes is different in the situation that data volume is very large and the gap of desirable speed-up ratio are very little, almost as broad as long in the situation that nodes is no more than 3.

Claims (2)

1. the parallel k means clustering method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) select the initial center of k class;
(3) at the Mapper of each node of Hadoop platform end, calculate each data object to the distance of all initial center;
(4) in the central point ,Jiang local data of described Mapper end chosen distance minimum, send to the Combiner end of each node of Hadoop platform;
(5) at described Combiner end, a set of data objects that belongs to same central point is combined, calculate belong to same central point data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform;
(6) at described Reducer end, gather the local data of all clusters, calculate all cluster Xin center;
(7) repeated execution of steps (3) is to (6), until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.
2. a kind of parallel k means clustering method based on Hadoop according to claim 1, is characterized in that: described local data is (key, value), wherein centered by key, and the subset that value is data object.
CN201310568611.1A 2013-11-15 2013-11-15 Parallel k mean value clustering method based on Hadoop Pending CN103544328A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310568611.1A CN103544328A (en) 2013-11-15 2013-11-15 Parallel k mean value clustering method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310568611.1A CN103544328A (en) 2013-11-15 2013-11-15 Parallel k mean value clustering method based on Hadoop

Publications (1)

Publication Number Publication Date
CN103544328A true CN103544328A (en) 2014-01-29

Family

ID=49967780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310568611.1A Pending CN103544328A (en) 2013-11-15 2013-11-15 Parallel k mean value clustering method based on Hadoop

Country Status (1)

Country Link
CN (1) CN103544328A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN105760478A (en) * 2016-02-15 2016-07-13 中山大学 Large-scale distributed data clustering method based on machine learning
WO2016123808A1 (en) * 2015-02-06 2016-08-11 华为技术有限公司 Data processing system, calculation node and data processing method
CN107392239A (en) * 2017-07-11 2017-11-24 南京邮电大学 A kind of K Means algorithm optimization methods based on Spark computation models

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
向小军 等: "基于Hadoop平台的海量文本分类的并行化", 《计算机科学》 *
张明辉: "基于Hadoop的数据挖掘算法的分析与研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016123808A1 (en) * 2015-02-06 2016-08-11 华为技术有限公司 Data processing system, calculation node and data processing method
CN106062732A (en) * 2015-02-06 2016-10-26 华为技术有限公司 Data processing system, calculation node and data processing method
CN106062732B (en) * 2015-02-06 2019-03-01 华为技术有限公司 Data processing system, calculate node and the method for data processing
US10567494B2 (en) 2015-02-06 2020-02-18 Huawei Technologies Co., Ltd. Data processing system, computing node, and data processing method
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
CN105138661B (en) * 2015-09-02 2018-10-30 西北大学 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN105760478A (en) * 2016-02-15 2016-07-13 中山大学 Large-scale distributed data clustering method based on machine learning
CN107392239A (en) * 2017-07-11 2017-11-24 南京邮电大学 A kind of K Means algorithm optimization methods based on Spark computation models

Similar Documents

Publication Publication Date Title
Hu et al. Scalability of control planes for software defined networks: Modeling and evaluation
CN103544328A (en) Parallel k mean value clustering method based on Hadoop
Liang et al. Solving the blocking flow shop scheduling problem by a dynamic multi-swarm particle swarm optimizer
Wang et al. Accelerating federated learning with cluster construction and hierarchical aggregation
CN103095598B (en) Monitor data polymerization under a kind of large-scale cluster environment
CN103942108A (en) Resource parameter optimization method under Hadoop homogenous cluster
CN105550238A (en) Architecture system of database appliance
CN101631395B (en) Method for removing interference noise from moving object locating in wireless sensor network
Sepúlveda et al. 3DMIA: A multi-objective artificial immune algorithm for 3D-MPSoC multi-application 3D-NoC mapping
Chen et al. An improved particle swarm optimization algorithm based on centroid and exponential inertia weight
CN103984832A (en) Simulation analysis method for electric field of aluminum electrolysis cell
Yang et al. A self-adaptive sliding window technique for mining data streams
Al Rasyid et al. LEACH Partition Topology for Wireless Sensor Network
CN111222680A (en) Wind power station output ultra-short-term prediction method based on least square support vector machine
Thirumurugan et al. Ex-PAC: An improved clustering technique for ad hoc network
Yuejuan et al. Task scheduling algorithm based on reliability perception in cloud computing
CN105337759B (en) It is a kind of based on inside and outside community structure than measure and community discovery method
Gu et al. A random distribution harmony search algorithm
Jin et al. Computation offloading and resource allocation for MEC in C-RAN: A deep reinforcement learning approach
CN109802440B (en) Offshore wind farm equivalence method, system and device based on wake effect factor
CN104168572A (en) Artificial physics optimization cognitive radio network spectrum distribution method
CN103942368A (en) Structure design method for laser cutting machine tool
CN105631920A (en) Sample simplifying method of radial basis function support points
Du et al. FASTBEE: A fast and self-adaptive clustering algorithm towards to edge computing
CN102622446A (en) Hadoop based parallel k nearest neighbor classification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140129