WO2018219163A1 - 一种基于MapReduce的大规模数据分布式聚类处理方法 - Google Patents

一种基于MapReduce的大规模数据分布式聚类处理方法 Download PDF

Info

Publication number
WO2018219163A1
WO2018219163A1 PCT/CN2018/087567 CN2018087567W WO2018219163A1 WO 2018219163 A1 WO2018219163 A1 WO 2018219163A1 CN 2018087567 W CN2018087567 W CN 2018087567W WO 2018219163 A1 WO2018219163 A1 WO 2018219163A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cluster
clustering
center point
cluster center
Prior art date
Application number
PCT/CN2018/087567
Other languages
English (en)
French (fr)
Inventor
高天寒
孔雪
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Publication of WO2018219163A1 publication Critical patent/WO2018219163A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Definitions

  • the invention belongs to the field of parallel clustering technology, in particular to a large-scale data distributed clustering processing method based on MapReduce.
  • Cluster analysis is an important data processing technology and one of the important topics in the field of machine learning and artificial intelligence. It is widely used in data mining and information retrieval.
  • the main work is to divide the data set into multiple subsets, so that the similarity between the data objects in the subset is higher, and the difference between the data objects in different subsets is larger.
  • MapReduce is a parallel programming model for large-scale data sets that is simple, easy to implement, and easy to extend.
  • the core idea is to divide and conquer.
  • the large-scale data set is divided into small data sets, which are processed by the sub-nodes managed by the main node, and then the intermediate results of each sub-node are integrated to obtain the final result.
  • scholars have carried out a series of researches on large-scale data clustering.
  • the K-Means method is one of the classical clustering analysis methods based on partitioning.
  • the advantage is that it has simple operation and fast convergence.
  • the disadvantage is that for the initial The selection of the clustering center adopts a random method, which leads to the local optimization of the cluster and affects the final clustering effect. Therefore, ensuring the accuracy of the initial clustering center is an important part of large-scale data parallel clustering.
  • the current research hotspot is the initial clustering center point selection method of parallel clustering method. It is mainly divided into two methods: K-Means combined with Canopy method to determine cluster center and data clustering calculation based on initial cluster center.
  • K-Means combined with the Canopy method Canopy-Kmeans, using the characteristics of Canopy to calculate the similarity of objects, preprocessing the data
  • the advantage is that the initial cluster center point can be given to avoid falling into local optimum, but the disadvantage is that between computing objects The time of similarity is relatively large.
  • the method based on data density calculation is to calculate the density of all data, and then select the data with the highest density as the cluster center point to avoid the problem of random selection, and it is more accurate, but the traditional calculation overhead is also large, and it is easy to cause nodes.
  • the load is large, reducing the overall efficiency of parallel clustering.
  • the present invention provides a large-scale data distributed clustering processing method based on MapReduce.
  • a large-scale data distributed clustering processing method based on MapReduce including:
  • Step 1 Sampling large-scale data on the principle that the scale is not repeated, and recording the sampled data
  • Step 2 Start a Hadoop distributed cluster environment, input sampling data to the MapReduce distributed parallel framework, and calculate local density and average density of the sampled data;
  • Step 3 The master node issues a task to the child node based on the average density Avg of the sampled data, and each child node sorts according to the local density, and finds all the sample data whose local density is greater than the average density Avg as the initial cluster of each cluster.
  • the set of candidate points of the central point is fed back to the primary node, and the primary node selects all the candidate points whose distance between each two adjacent candidate points in the candidate point set is greater than 2 times the set range as the initial clustering center point;
  • Step 4 The primary node receives the initial cluster center point distribution task to the child node, and the child node performs the parallel clustering task according to the initial cluster center point by using the MapReduce distributed parallel framework, and updates the average value of the distance between each cluster for each cluster.
  • Cluster center point
  • Step 5 The child node applies the error square sum criterion function as a clustering criterion function to determine whether to continue the iteration: if the error squared criterion function calculated according to the updated cluster center point is convergent, the current cluster center point For the final cluster center point and feedback to the master node, perform step 6; otherwise return to step 4 to continue iteratively updating the cluster center point;
  • Step 6 The primary node re-enters the cluster center point and distributes the tasks, and each child node clusters the large-scale data according to the cluster center point.
  • D represents a large-scale data set
  • D i and D j represent two data sets that do not intersect
  • i and j range from 1 to N.
  • the data sizes of the data sets D i and D j are respectively recorded as f i and f j
  • N indicates that the sampling number e indicates the data size of the sample
  • f is the proportion of the sampled data in the large-scale data set
  • the value is 0 ⁇ f ⁇ 0.1
  • is the sampling probability
  • the value is 0.5 ⁇ ⁇ ⁇ 1.
  • the step 2 includes:
  • Step 2.1 Upload the sampled data to the Hadoop distributed cluster environment
  • Step 2.2 The primary node in the Hadoop distributed cluster environment divides the incoming sample data into multiple data blocks, and sends them to each child node for distributed processing to calculate the local density of the sampled data;
  • Step 2.3 Each sub-node receives the task, and uses the MapReduce distributed parallel framework to perform local density calculation on the sampled data corresponding to each task, that is, calculates the number of neighbor data in the set range around the sampled data;
  • Step 2.4 Each child node feeds back the calculated local density to the master node, and the master node integrates and calculates the average density of the sampled data according to each local density, and outputs the average density and local density of the sampled data.
  • i and j represent the i-th data and the j-th data, respectively, and n represents that the sampled data has n attributes, such as an iris iris flower data set, and the attributes of each data include the length of the flower, the width of the flower, and the like, i n represents the data.
  • the nth attribute data of i, j n represents the nth attribute data of the data j, and D ij represents the distance of the i th data and the j th data.
  • ⁇ i represents the local density of the i-th data
  • m represents the number of data
  • D e represents the interception radius around the i-th data, that is, the set range
  • is a coefficient, if the neighbor data belongs to the intercept radius range, that is, within the set range , ⁇ takes a value of 1, otherwise the value is 0.
  • Avg represents the average density of m sample data
  • ⁇ i represents the local density of the i-th sample data
  • the invention provides a large-scale data distributed clustering processing method based on MapReduce, which uses the MapReduce distributed parallel framework to calculate the local density of the sampled data distributedly by sampling the large-scale data on the principle of non-repetition on a scale. After the integration, the average density of the data is calculated, so that the appropriate and accurate initial cluster center points are selected to achieve parallel clustering, the number of cluster iterations is reduced, the clustering accuracy rate and the parallel clustering efficiency are improved, which is very suitable for large-scale data parallel clustering. Analyze and solve the classification problem for some sample sets that do not have classification and do not know the category labels. Clustering can be applied to research fields such as image cluster analysis processing. K-Means is one of the classical clustering analysis algorithms based on partitioning. Because it has the characteristics of simple operation and fast convergence, it parallelizes the algorithm and adapts it to parallel cluster mode for large-scale data.
  • FIG. 1 is a block diagram of a Hadoop distributed cluster environment adopted in a specific embodiment of the present invention.
  • FIG. 2 is a flowchart of data processing based on a MapReduce parallel framework in an embodiment of the present invention
  • FIG. 3 is a flowchart of a large-scale data distributed clustering processing method based on MapReduce in a specific embodiment of the present invention
  • Figure 4 is a flow chart of step 2 in a specific embodiment of the present invention.
  • Figure 5 is a comparison of experimental results in a specific embodiment of the present invention, (a) comparison results of the accuracy of the three methods, and (b) comparison results of time consumption experiments of the three methods.
  • the Hadoop distributed cluster environment in this embodiment has three servers, and constitutes three nodes, including one master node for sending a command distribution task, and two child nodes slave for receiving tasks distributed by the master node.
  • the running task is processed according to the requirements of the master node, and all nodes are connected through high-speed Ethernet.
  • the master node master starts the entire cluster environment according to the application request of the user.
  • the child node slave and the master node are the main body of the parallel system of the Hadoop distributed cluster environment, and are responsible for the processing operation of the entire Hadoop distributed cluster. As shown in FIG.
  • the drawing object in this embodiment adopts the iris data set in the UCI Machine Learning Repository, also called the iris flower data set, and is a data set of multiple variable analysis.
  • the number of data sets is 30, 60, 90, 120, 150 respectively.
  • the traditional K-means parallel method, the density-based K-means parallel method and the clustering effect of the method are tested. , mainly from the accuracy rate, time consumption and other aspects of comparison.
  • the comparison of the experimental results is shown in Figures 5(a) and (b).
  • the large-scale data distributed clustering processing method based on MapReduce includes:
  • Step 1 Sampling large-scale data on the principle that the scale is not repeated, and recording the sampled data
  • D represents a large-scale data set
  • D i and D j represent two data sets that do not intersect
  • i and j range from 1 to N.
  • the data sizes of the data sets D i and D j are respectively recorded as f i and f j
  • N indicates that the sampling number e indicates the data size of the sample
  • f is the proportion of the sampled data in the large-scale data set
  • the value is 0 ⁇ f ⁇ 0.1
  • is the sampling probability
  • the value is 0.5 ⁇ ⁇ ⁇ 1.
  • Step 2 Start a Hadoop distributed cluster environment, input sampling data to the MapReduce distributed parallel framework, and calculate local density and average density of the sampled data;
  • the step 2, as shown in FIG. 4, includes:
  • Step 2.1 In the Centos system, start the Hadoop distributed cluster environment by using the start-all.sh command, and upload the sampled data to the Hadoop distributed cluster environment.
  • Step 2.2 The primary node in the Hadoop distributed cluster environment divides the incoming sample data into multiple data blocks, and sends them to each child node for distributed processing to calculate the local density of the sampled data;
  • Step 2.3 Each sub-node receives the task, and uses the MapReduce distributed parallel framework to perform local density calculation on the sampled data corresponding to each task, that is, calculates the number of neighbor data in the set range around the sampled data;
  • i and j represent the i-th data and the j-th data, respectively, and n represents that the sampled data has n attributes, such as an iris iris flower data set, and the attributes of each data include the length of the flower, the width of the flower, and the like, i n represents the data.
  • the nth attribute data of i, j n represents the nth attribute data of the data j, and D ij represents the distance of the i th data and the j th data.
  • ⁇ i represents the local density of the i-th data
  • m represents the number of data
  • D e represents the interception radius around the i-th data, that is, the set range
  • is a coefficient, if the neighbor data belongs to the intercept radius range, that is, within the set range , ⁇ takes a value of 1, otherwise the value is 0.
  • Step 2.4 Each child node feeds back the calculated local density to the master node, and the master node integrates and calculates the average density of the sampled data according to each local density, and outputs the average density and local density of the sampled data;
  • Avg represents the average density of m sample data
  • ⁇ i represents the local density of the i-th sample data
  • Step 3 The master node issues a task to the child node based on the average density Avg of the sampled data, and each child node sorts according to the local density, and finds all sample data whose local density is greater than the average density Avg as each cluster (represented by the cluster) Is a set of candidate points of the initial cluster center point of a type of data and feeds back to the master node, and the master node selects all candidate points whose distance between each two adjacent candidate points in the set of candidate points is greater than 2 times the initial range as the initial Cluster center point;
  • the initial clustering center point is selected: firstly, the candidate point with the largest local density is selected as the first initial cluster center point in the candidate point set, and then the distance from the first initial cluster center point is greater than 2De (De is intercepted) The candidate point of the radius is used as the second initial cluster center point.
  • the third initial cluster center point is a candidate point whose distance from the first and second initial cluster center points is greater than 2De until The last candidate point in the candidate point set is selected, and the selection of the initial cluster center point is ended.
  • Step 4 The primary node receives the initial cluster center point distribution task to the child node, and the child node performs the parallel clustering task according to the initial cluster center point by using the MapReduce distributed parallel framework, and updates the average value of the distance between each cluster for each cluster.
  • Cluster center point
  • e i is the average distance between data of cluster C i , that is, a new cluster center point, and x is data in cluster C i .
  • Step 5 The child node applies the error square sum criterion function as a clustering criterion function to determine whether to continue the iteration: if the error squared criterion function calculated according to the updated cluster center point is convergent, the current cluster center point For the final cluster center point and feedback to the master node, perform step 6; otherwise return to step 4 to continue iteratively updating the cluster center point.
  • the squared error sum criterion function is calculated as:
  • M is the sum of the variances of all the data in the cluster
  • n is a data object in the cluster Ci
  • ei is the average of the distances between the data in the cluster Ci
  • k is the number of cluster center points.
  • Step 6 The primary node re-enters the cluster center point and distributes the tasks, and each child node clusters the large-scale data according to the cluster center point.

Abstract

本发明提供一种基于MapReduce的大规模数据分布式聚类处理方法,包括对大规模数据以等规模不重复的原则进行抽样;向MapReduce分布式并行框架输入抽样数据并计算抽样数据的局部密度和平均密度;找出局部密度大于平均密度的所有抽样数据作为每个簇的初始聚类中心点的候选点集合并反馈给主节点,选取每两个相邻候选点之间距离大于2倍设定范围的所有候选点作为初始聚类中心点;利用MapReduce分布式并行框架进行并行聚类任务,针对每个簇计算数据间距离的平均值来更新聚类中心点;子节点应用误差平方和准则函数判断是否继续迭代;各子节点根据聚类中心点对大规模数据进行聚类。本发明实现并行聚类,减少聚类迭代次数,提高聚类准确率和并行聚类效率。

Description

一种基于MapReduce的大规模数据分布式聚类处理方法 技术领域
本发明属于并行聚类技术领域,特别是一种基于MapReduce的大规模数据分布式聚类处理方法。
背景技术
伴随信息技术的快速发展,数据规模不断增大,利用并行机制对大规模数据集进行有效地挖掘分析,可以推动互联网技术的发展和进步。聚类分析是一种重要的数据处理技术,是机器学习和人工智能领域的重要课题之一,被广泛用于数据挖掘、信息检索等研究中。主要工作是将数据集划分成多个子集,使得子集内的数据对象间的相似度较高,不同子集间的数据对象间的差异度较大。由于数据规模的增大,传统的单机聚类方法已经无法在有效地时间内处理大规模数据且效率低下,聚类效果不理想,而随之而来的大数据技术愈加成熟,越来越多的人开始关注学习Hadoop MapReduce相关技术。因此,建立一个并行集群模式利用MapReduce并行框架是解决这些问题的一个重要研究方向。
MapReduce是一个应用于大规模数据集的并行编程模型,特点是简单,容易实现和易于扩展。核心思想就是“分而治之”,把大规模数据集分成一个个小的数据集,交由主节点管理下的各分节点共同处理,然后把各分节点的中间结果进行整合,得到最终结果。近年来,学者们针对大规模数据聚类展开了一系列的研究,其中K-Means方法是基于划分的经典聚类分析方法之一,优点是具有操作简单,收敛速度较快,缺点是对于初始聚类中心的选取采用随机方式,易导致聚类局部最优,影响最后的聚类效果。因此保证初始聚类中心的准确性是面向大规模数据并行聚类的重要环节。
目前的研究热点是并行聚类方法的初始聚类中心点选择方法,主要分为K-Means结合Canopy方法确定聚类中心和基于数据密度计算确定初始聚类中心两种方法。K-Means方法结合Canopy方法Canopy-Kmeans,利用Canopy的特点计算对象的相似性,将数据做预处理,优势在于可以给定初始聚类中心点,避免陷入局部最优,但是缺点是计算对象间的相似性的时间耗费较大。基于数据密度计算的方法是计算出所有数据的密度,然后选择密度最大的数据作为聚类中心点从而避免了随机选取的问题,且较为准确,但是传统的计算开销也较大,且易导致节点负载较大,降低并行聚类总体效率。
发明内容
针对现有技术中存在的问题,本发明提供一种基于MapReduce的大规模数据分布式聚类 处理方法。
本发明的技术方案如下:
一种基于MapReduce的大规模数据分布式聚类处理方法,包括:
步骤1、对大规模数据以等规模不重复的原则进行抽样,记录抽样数据;
步骤2、启动Hadoop分布式集群环境,向MapReduce分布式并行框架输入抽样数据并计算抽样数据的局部密度和平均密度;
步骤3、主节点以抽样数据的平均密度Avg为基准下发任务到子节点,各个子节点根据局部密度进行排序,找出局部密度大于平均密度Avg的所有抽样数据作为每个簇的初始聚类中心点的候选点集合并反馈给主节点,主节点选取候选点集合中每两个相邻候选点之间距离大于2倍设定范围的所有候选点作为初始聚类中心点;
步骤4、主节点接收初始聚类中心点分布任务给子节点,子节点根据初始聚类中心点利用MapReduce分布式并行框架进行并行聚类任务,针对每个簇计算数据间距离的平均值来更新聚类中心点;
步骤5:子节点应用误差平方和准则函数作为聚类准则函数,判断是否继续迭代:若根据更新后的聚类中心点计算的误差平方和准则函数是收敛的,则当前的各聚类中心点为最终的聚类中心点并反馈给主节点,执行步骤6;否则返回步骤4继续迭代更新聚类中心点;
步骤6:主节点重新输入聚类中心点并分布任务,各子节点根据聚类中心点对大规模数据进行聚类。
所述以等规模不重复的原则进行抽样,采用的公式如下:
Figure PCTCN2018087567-appb-000001
f i≈f j且Nf i<<D
e=f*n*δ
其中,D表示大规模数据集,D i和D j表示两个没有交集的数据集,i和j的范围在1到N之间。数据集D i和D j的数据规模分别记为f i和f j,N表示抽样次数e表示抽样的数据大小,f为抽样的数据在大规模数据集中所占的比例,取值为0≤f≤0.1,δ为抽样概率,取值为0.5≤δ≤1。
所述步骤2,包括:
步骤2.1、将抽样数据上传到Hadoop分布式集群环境;
步骤2.2、Hadoop分布式集群环境中的主节点对传入的抽样数据进行分割成多个数据块,并下发到各个子节点进行分布式处理计算抽样数据的局部密度;
步骤2.3、各个子节点接收任务,利用MapReduce分布式并行框架对各个任务对应的抽样数据进行局部密度计算,即计算抽样数据周围设定范围内的邻居数据的个数;
步骤2.4、各个子节点将计算出的局部密度反馈给主节点,主节点进行整合并根据各局部密度来计算出抽样数据的平均密度,输出抽样数据的平均密度和局部密度。
所述局部密度的计算公式如下:
Figure PCTCN2018087567-appb-000002
Figure PCTCN2018087567-appb-000003
其中,i和j分别表示第i个数据和第j个数据,n表示抽样数据有n个属性,例如iris鸢尾花卉数据集,每个数据的属性包括花萼长度,花萼宽度等,i n表示数据i的第n个属性数据,j n表示数据j的第n个属性数据,D ij表示第i个数据和第j个数据的距离。ρ i表示第i个数据的局部密度,m表示数据的个数,D e表示为第i个数据周围截取半径即设定范围,λ为系数,若邻居数据属于截取半径范围即设定范围内,则λ取值为1,否则值为0。
所述平均密度计算公式:
Figure PCTCN2018087567-appb-000004
其中,Avg表示m个抽样数据的平均密度,ρ i表示第i个抽样数据的局部密度。
有益效果:
本发明提供了一种基于MapReduce的大规模数据分布式聚类处理方法,通过对大规模数据以等规模不重复的原则进行抽样,利用MapReduce分布式并行框架分布式地对抽样数据计算局部密度,在整合后计算数据的平均密度,从而选取合适准确的初始聚类中心点实现并行聚类,减少聚类迭代次数,提高聚类准确率和并行聚类效率,非常适用于大规模数据并行聚类分析,解决对于一些没有分类,不知道类别标签的样本集进行分类问题,聚类可以应用到图像聚类分析处理等研究领域。K-Means是基于划分的经典聚类分析算法之一,因为其具有操作简单,收敛速度较快等特点,并行化该算法使其适应于并行集群模式从而应用于大规模数据。
附图说明
图1是本发明具体实施方式中采用的Hadoop分布式集群环境框图;
图2是本发明具体实施方式中基于MapReduce并行框架的数据处理流程图;
图3是本发明具体实施方式中基于MapReduce的大规模数据分布式聚类处理方法流程图;
图4是本发明具体实施方式中步骤2流程图;
图5是本发明具体实施方式中实验结果对比图,(a)三种方法的准确率实验对比结果,(b)三种方法的时间消耗实验对比结果。
具体实施方式
下面结合附图对本发明的具体实施方式做详细说明。
如图1所示,本实施方式中的Hadoop分布式集群环境有3个服务器,构成3个节点,其中包括一个主节点Master用来发号施令分布任务,2个子节点slave用来接收主节点分发的任务并根据主节点Master的要求处理运行任务,所有节点通过高速以太网进行相连。主节点Master根据用户的应用请求启动整个集群环境,子节点slave和主节点Master作为Hadoop分布式集群环境并行系统的主体,负责整个Hadoop分布式集群的处理运行。如图2所示,本实施方式中:1)根据用户的要求接收待处理数据,将输入的文件进行分割成数据块,以键值对<key1,value1>形式分发给各个子节点;2)子节点接收数据块进行map函数处理,将处理后的新键值对<key2,value2>发送给本节点的合并端进行中间数据合并处理,形成<key2,list<value2>>;3)子节点将合并的数据发送到reduce端进行reduce函数处理,整合各个节点的数据结果,输出最后的结果<key3,value3>。
本实施方式中的绘制对象采用UCI Machine Learning Repository中的iris数据集也称鸢尾花卉数据集,是一类多重变量分析的数据集。其中有150个样本数据,分为3类,每类中有50个数据包,每个数据包含4个属性。分别使用数据集数量为:30、60、90、120、150,根据数据集数量的大小,分别对传统K-means并行方法、基于密度计算K-means并行方法和本发明方法聚类效果进行测试,主要从准确率、时间消耗等方面进行比较。实验结果对比图如图5(a)、(b)所示。
所述基于MapReduce的大规模数据分布式聚类处理方法,如图3所示,包括:
步骤1、对大规模数据以等规模不重复的原则进行抽样,记录抽样数据;
所述等规模不重复抽样规则公式如下:
Figure PCTCN2018087567-appb-000005
f i≈f j且Nf i<<D
e=f*n*δ
其中,D表示大规模数据集,D i和D j表示两个没有交集的数据集,i和j的范围在1到N之间。数据集D i和D j的数据规模分别记为f i和f j,N表示抽样次数e表示抽样的数据大 小,f为抽样的数据在大规模数据集中所占的比例,取值为0≤f≤0.1,δ为抽样概率,取值为0.5≤δ≤1。
步骤2、启动Hadoop分布式集群环境,向MapReduce分布式并行框架输入抽样数据并计算抽样数据的局部密度和平均密度;
所述步骤2,如图4所示,包括:
步骤2.1、在Centos系统中,通过start-all.sh命令启动Hadoop分布式集群环境,将抽样数据上传到Hadoop分布式集群环境;
步骤2.2、Hadoop分布式集群环境中的主节点对传入的抽样数据进行分割成多个数据块,并下发到各个子节点进行分布式处理计算抽样数据的局部密度;
步骤2.3、各个子节点接收任务,利用MapReduce分布式并行框架对各个任务对应的抽样数据进行局部密度计算,即计算抽样数据周围设定范围内的邻居数据的个数;
所述局部密度计算公式:
Figure PCTCN2018087567-appb-000006
Figure PCTCN2018087567-appb-000007
其中,i和j分别表示第i个数据和第j个数据,n表示抽样数据有n个属性,例如iris鸢尾花卉数据集,每个数据的属性包括花萼长度,花萼宽度等,i n表示数据i的第n个属性数据,j n表示数据j的第n个属性数据,D ij表示第i个数据和第j个数据的距离。ρ i表示第i个数据的局部密度,m表示数据的个数,D e表示为第i个数据周围截取半径即设定范围,λ为系数,若邻居数据属于截取半径范围即设定范围内,则λ取值为1,否则值为0。
步骤2.4、各个子节点将计算出的局部密度反馈给主节点,主节点进行整合并根据各局部密度来计算出抽样数据的平均密度,输出抽样数据的平均密度和局部密度;
所述平均密度计算公式:
Figure PCTCN2018087567-appb-000008
其中,Avg表示m个抽样数据的平均密度,ρ i表示第i个抽样数据的局部密度。
步骤3、主节点以抽样数据的平均密度Avg为基准下发任务到子节点,各个子节点根据局部密度进行排序,找出局部密度大于平均密度Avg的所有抽样数据作为每个簇(簇表示的 是一类数据)的初始聚类中心点的候选点集合并反馈给主节点,主节点选取候选点集合中每两个相邻候选点之间距离大于2倍设定范围的所有候选点作为初始聚类中心点;
初始聚类中心点的选取:首先在候选点集合中选取局部密度最大的候选点作为第一个初始聚类中心点,接着选取与第一个初始聚类中心点的距离大于2De(De为截取半径)的候选点作为第二个初始聚类中心点,以此方式,第三个初始聚类中心点是与第一、第二个初始聚类中心点的距离都大于2De的候选点,直到选取到候选点集合中的最后一个候选点,结束初始聚类中心点的选取。
步骤4、主节点接收初始聚类中心点分布任务给子节点,子节点根据初始聚类中心点利用MapReduce分布式并行框架进行并行聚类任务,针对每个簇计算数据间距离的平均值来更新聚类中心点;
新的聚类中心点计算公式:
Figure PCTCN2018087567-appb-000009
其中,e i为簇C i的数据间距离平均值即新的聚类中心点,x为簇C i中的数据。
步骤5:子节点应用误差平方和准则函数作为聚类准则函数,判断是否继续迭代:若根据更新后的聚类中心点计算的误差平方和准则函数是收敛的,则当前的各聚类中心点为最终的聚类中心点并反馈给主节点,执行步骤6;否则返回步骤4继续迭代更新聚类中心点。
误差平方和准则函数计算公式为:
Figure PCTCN2018087567-appb-000010
其中,M为簇中所有数据的方差之和,n为簇Ci中的一个数据对象,ei为簇Ci中数据间距离的平均值,k表示聚类中心点的个数。
步骤6:主节点重新输入聚类中心点并分布任务,各子节点根据聚类中心点对大规模数据进行聚类。

Claims (5)

  1. 一种基于MapReduce的大规模数据分布式聚类处理方法,其特征在于,包括:
    步骤1、对大规模数据以等规模不重复的原则进行抽样,记录抽样数据;
    步骤2、启动Hadoop分布式集群环境,向MapReduce分布式并行框架输入抽样数据并计算抽样数据的局部密度和平均密度;
    步骤3、主节点以抽样数据的平均密度Avg为基准下发任务到子节点,各个子节点根据局部密度进行排序,找出局部密度大于平均密度Avg的所有抽样数据作为每个簇的初始聚类中心点的候选点集合并反馈给主节点,主节点选取候选点集合中每两个相邻候选点之间距离大于2倍设定范围的所有候选点作为初始聚类中心点;
    步骤4、主节点接收初始聚类中心点分布任务给子节点,子节点根据初始聚类中心点利用MapReduce分布式并行框架进行并行聚类任务,针对每个簇计算数据间距离的平均值来更新聚类中心点;
    步骤5:子节点应用误差平方和准则函数作为聚类准则函数,判断是否继续迭代:若根据更新后的聚类中心点计算的误差平方和准则函数是收敛的,则当前的各聚类中心点为最终的聚类中心点并反馈给主节点,执行步骤6;否则返回步骤4继续迭代更新聚类中心点;
    步骤6:主节点重新输入聚类中心点并分布任务,各子节点根据聚类中心点对大规模数据进行聚类。
  2. 根据权利要求1所述的方法,其特征在于,所述以等规模不重复的原则进行抽样,采用的公式如下:
    Figure PCTCN2018087567-appb-100001
    f i≈f j且Nf i<<D
    e=f*n*δ
    其中,D表示大规模数据集,D i和D j表示两个没有交集的数据集,i和j的范围在1到N之间。数据集D i和D j的数据规模分别记为f i和f j,N表示抽样次数e表示抽样的数据大小,f为抽样的数据在大规模数据集中所占的比例,取值为0≤f≤0.1,δ为抽样概率,取值为0.5≤δ≤1。
  3. 根据权利要求1所述的方法,其特征在于,所述步骤2,包括:
    步骤2.1、将抽样数据上传到Hadoop分布式集群环境;
    步骤2.2、Hadoop分布式集群环境中的主节点对传入的抽样数据进行分割成多个数据块,并下发到各个子节点进行分布式处理计算抽样数据的局部密度;
    步骤2.3、各个子节点接收任务,利用MapReduce分布式并行框架对各个任务对应的抽 样数据进行局部密度计算,即计算抽样数据周围设定范围内的邻居数据的个数;
    步骤2.4、各个子节点将计算出的局部密度反馈给主节点,主节点进行整合并根据各局部密度来计算出抽样数据的平均密度,输出抽样数据的平均密度和局部密度。
  4. 根据权利要求1或3所述的方法,其特征在于,所述局部密度的计算公式如下:
    Figure PCTCN2018087567-appb-100002
    Figure PCTCN2018087567-appb-100003
    其中,i和j分别表示第i个数据和第j个数据,n表示抽样数据有n个属性,例如iris鸢尾花卉数据集,每个数据的属性包括花萼长度,花萼宽度等,i n表示数据i的第n个属性数据,j n表示数据j的第n个属性数据,D ij表示第i个数据和第j个数据的距离。ρ i表示第i个数据的局部密度,m表示数据的个数,D e表示为第i个数据周围截取半径即设定范围,λ为系数,若邻居数据属于截取半径范围即设定范围内,则λ取值为1,否则值为0。
  5. 根据权利要求1或3所述的方法,其特征在于,所述平均密度计算公式:
    Figure PCTCN2018087567-appb-100004
    其中,Avg表示m个抽样数据的平均密度,ρ i表示第i个抽样数据的局部密度。
PCT/CN2018/087567 2017-06-02 2018-05-18 一种基于MapReduce的大规模数据分布式聚类处理方法 WO2018219163A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710412014.8A CN107291847B (zh) 2017-06-02 2017-06-02 一种基于MapReduce的大规模数据分布式聚类处理方法
CN201710412014.8 2017-06-02

Publications (1)

Publication Number Publication Date
WO2018219163A1 true WO2018219163A1 (zh) 2018-12-06

Family

ID=60094757

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/087567 WO2018219163A1 (zh) 2017-06-02 2018-05-18 一种基于MapReduce的大规模数据分布式聚类处理方法

Country Status (2)

Country Link
CN (1) CN107291847B (zh)
WO (1) WO2018219163A1 (zh)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291847B (zh) * 2017-06-02 2019-06-25 东北大学 一种基于MapReduce的大规模数据分布式聚类处理方法
CN108122012B (zh) * 2017-12-28 2020-11-24 百度在线网络技术(北京)有限公司 常驻点中心点的确定方法、装置、设备及存储介质
CN110233798B (zh) * 2018-03-05 2021-02-26 华为技术有限公司 数据处理方法、装置及系统
CN109033084B (zh) * 2018-07-26 2022-10-28 国信优易数据股份有限公司 一种语义层次树构建方法以及装置
CN109302406B (zh) * 2018-10-31 2021-06-25 法信公证云(厦门)科技有限公司 一种分布式网页取证的方法及系统
CN109242048B (zh) * 2018-11-07 2022-04-08 电子科技大学 基于时间序列的视觉目标分布式聚类方法
CN109410588B (zh) * 2018-12-20 2022-03-15 湖南晖龙集团股份有限公司 一种基于交通大数据的交通事故演化分析方法
CN109885685A (zh) * 2019-02-01 2019-06-14 珠海世纪鼎利科技股份有限公司 情报数据处理的方法、装置、设备及存储介质
CN110069467A (zh) * 2019-04-16 2019-07-30 沈阳工业大学 基于皮尔逊系数与MapReduce并行计算的电网尖峰负荷聚类提取法
CN110222248A (zh) * 2019-05-28 2019-09-10 长江大学 一种大数据聚类方法及装置
CN110276449B (zh) * 2019-06-24 2021-06-04 深圳前海微众银行股份有限公司 一种基于无监督学习的数据处理方法及装置
CN111079653B (zh) * 2019-12-18 2024-03-22 中国工商银行股份有限公司 数据库自动分库方法及装置
CN111401412B (zh) * 2020-02-29 2022-06-14 同济大学 一种基于平均共识算法的物联网环境下分布式软聚类方法
CN111597230A (zh) * 2020-05-15 2020-08-28 江西理工大学 基于MapReduce的并行密度聚类挖掘方法
CN113515512A (zh) * 2021-06-22 2021-10-19 国网辽宁省电力有限公司鞍山供电公司 一种工业互联网平台数据的质量治理及提升方法
CN115952426B (zh) * 2023-03-10 2023-06-06 中南大学 基于随机采样的分布式噪音数据聚类方法及用户分类方法
CN116595102B (zh) * 2023-07-17 2023-10-17 法诺信息产业有限公司 一种改进聚类算法的大数据管理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN103838863A (zh) * 2014-03-14 2014-06-04 内蒙古科技大学 一种基于云计算平台的大数据聚类算法
CN104615638A (zh) * 2014-11-25 2015-05-13 浙江银江研究院有限公司 一种面向大数据的分布式密度聚类方法
CN107291847A (zh) * 2017-06-02 2017-10-24 东北大学 一种基于MapReduce的大规模数据分布式聚类处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN103838863A (zh) * 2014-03-14 2014-06-04 内蒙古科技大学 一种基于云计算平台的大数据聚类算法
CN104615638A (zh) * 2014-11-25 2015-05-13 浙江银江研究院有限公司 一种面向大数据的分布式密度聚类方法
CN107291847A (zh) * 2017-06-02 2017-10-24 东北大学 一种基于MapReduce的大规模数据分布式聚类处理方法

Also Published As

Publication number Publication date
CN107291847A (zh) 2017-10-24
CN107291847B (zh) 2019-06-25

Similar Documents

Publication Publication Date Title
WO2018219163A1 (zh) 一种基于MapReduce的大规模数据分布式聚类处理方法
Yuan et al. An improved network traffic classification algorithm based on Hadoop decision tree
CN102222092A (zh) 一种MapReduce平台上的海量高维数据聚类方法
Xu et al. Distributed maximal clique computation
CN103793438B (zh) 一种基于MapReduce的并行聚类方法
CN104834709B (zh) 一种基于负载均衡的并行余弦模式挖掘方法
Zhang et al. Improvement of K-means algorithm based on density
Zhang et al. An improved parallel K-means algorithm based on MapReduce
Orlandi et al. Entropy to mitigate non-IID data problem on Federated Learning for the Edge Intelligence environment
Ma et al. Like attracts like: Personalized federated learning in decentralized edge computing
Yang et al. Parallel implementation of ant-based clustering algorithm based on hadoop
CN105335499A (zh) 一种基于分布-收敛模型的文献聚类方法
Barger et al. k-means for streaming and distributed big sparse data
Triguero et al. A combined mapreduce-windowing two-level parallel scheme for evolutionary prototype generation
Bawane et al. Clustering algorithms in MapReduce: a review
Li et al. GAP: Genetic algorithm based large-scale graph partition in heterogeneous cluster
Zheng et al. Large graph sampling algorithm for frequent subgraph mining
Li et al. Parallel k-dominant skyline queries over uncertain data streams with capability index
Cui et al. The learning stimulated sensing-transmission coordination via age of updates in distributed uav swarm
Wang et al. An adaptively disperse centroids k-means algorithm based on mapreduce model
Łukasik et al. Efficient astronomical data condensation using approximate nearest neighbors
She et al. The pruning algorithm of parallel shared decision tree based on Hadoop
Ling et al. Optimization of the distributed K-means clustering algorithm based on set pair analysis
Wang et al. Research on Clustream Algorithm Based on Spark
Yushui et al. K-means clustering algorithm for large-scale Chinese commodity information web based on Hadoop

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18808682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18808682

Country of ref document: EP

Kind code of ref document: A1