CN106484818A - A kind of hierarchy clustering method based on Hadoop and HBase - Google Patents
A kind of hierarchy clustering method based on Hadoop and HBase Download PDFInfo
- Publication number
- CN106484818A CN106484818A CN201610851970.1A CN201610851970A CN106484818A CN 106484818 A CN106484818 A CN 106484818A CN 201610851970 A CN201610851970 A CN 201610851970A CN 106484818 A CN106484818 A CN 106484818A
- Authority
- CN
- China
- Prior art keywords
- hbase
- distance
- cluster
- clusters
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 67
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 238000005516 engineering process Methods 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims abstract description 15
- 230000008676 import Effects 0.000 claims abstract description 8
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000013461 design Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 5
- 238000013500 data storage Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 5
- 230000004888 barrier function Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于Hadoop和HBase的层次聚类方法。该方法通过Hadoop来计算距离矩阵,并将结果转化为HFile文件用Bulk Load方法导入到HBase中,HBase用来存储距离矩阵,主要分为两张表,其中一个表按cluster ID对进行排序,另一个表按cluster间的距离进行排序,可以方便实现每次迭代取出距离最近的两个cluster进行合并。最后再实现一个多线程的算法并结合cache技术统一处理HBase中的距离矩阵,实现层次聚类算法并预留多个可调节参数,该算法同时支持single‑linkage、complete‑linkage和average‑linkage三种聚类法。本发明提出的方案利用了Hadoop的并行计算能力和HBase的海量数据存储能力,从而提高层次聚类算法的大数据处理能力和扩展性。
The invention discloses a hierarchical clustering method based on Hadoop and HBase. This method uses Hadoop to calculate the distance matrix, and converts the result into an HFile file and imports it into HBase with the Bulk Load method. HBase is used to store the distance matrix. It is mainly divided into two tables, one of which is sorted by cluster ID, and the other A table is sorted according to the distance between clusters, which can facilitate the merger of the two closest clusters in each iteration. Finally, implement a multi-threaded algorithm and combine the cache technology to uniformly process the distance matrix in HBase, implement the hierarchical clustering algorithm and reserve multiple adjustable parameters, and the algorithm supports single-linkage, complete-linkage and average-linkage at the same time a clustering method. The scheme proposed by the invention utilizes the parallel computing capability of Hadoop and the massive data storage capability of HBase, thereby improving the large data processing capability and expansibility of the hierarchical clustering algorithm.
Description
技术领域technical field
本发明涉及层次聚类算法、Hadoop和HBase相关技术领域,尤其涉及一种基于Hadoop与HBase的层次聚类方法的设计与实现。The invention relates to the technical field of hierarchical clustering algorithm, Hadoop and HBase, in particular to the design and realization of a hierarchical clustering method based on Hadoop and HBase.
背景技术Background technique
层次聚类算法作为一个简单并被广泛接受的聚类算法,已经在很多方面得到了应用,例如信息检索,生物信息学。层次聚类算法的优点是它将聚类结果以一个比较详细的方法表示出来。它将集群之间的聚类关系组织成一个树状图,用户可以清楚地知道各个集群是怎么被聚类到一起的,其他的很多聚类算法并没有给出这样的结果。而且,与k-means等其他聚类算法相比,层次聚类算法不用要求用户事先指定聚类的数量。虽然层次聚类算法具有很多优点并被广泛地接受和使用,但随着数据量的快速增长,单机的层次聚类算法的性能已经无法满足要需求,层次聚类算法的高复杂度和固有的数据依赖性使得它很难在大数据集上有效的执行。但往往只有从越充分的数据中才能提取出越有用的信息,数据集的大小成为了机器学习算法中一个很重要的因素。这促使我们迫切需要一个可以运行在大数据集上的层次聚类算法。As a simple and widely accepted clustering algorithm, hierarchical clustering algorithm has been applied in many fields, such as information retrieval and bioinformatics. The advantage of the hierarchical clustering algorithm is that it expresses the clustering results in a more detailed way. It organizes the clustering relationship between clusters into a dendrogram, and users can clearly know how each cluster is clustered together. Many other clustering algorithms do not give such results. Moreover, compared with other clustering algorithms such as k-means, the hierarchical clustering algorithm does not require the user to specify the number of clusters in advance. Although the hierarchical clustering algorithm has many advantages and is widely accepted and used, with the rapid growth of the data volume, the performance of the single-machine hierarchical clustering algorithm can no longer meet the requirements. The high complexity and inherent limitations of the hierarchical clustering algorithm Data dependencies make it difficult to perform efficiently on large datasets. However, more useful information can only be extracted from more sufficient data, and the size of the data set has become a very important factor in machine learning algorithms. This prompts us to urgently need a hierarchical clustering algorithm that can run on large data sets.
Hadoop是一个可以对海量数据进行分布式处理的软件框架,它提供了一种可靠、高效、可伸缩的方式来进行数据处理。而HBase是Hadoop生态圈中的另一重要成员,是一个非关系型的分布式数据库。它提供了一个高可靠性、高性能、可伸缩、面向列的存储系统,适合于非结构化数据的存储。作为非常重要的大数据处理技术,Hadoop与HBase都在许多大数据领域得到了广泛的应用。Hadoop is a software framework for distributed processing of massive data, which provides a reliable, efficient, and scalable way to process data. HBase is another important member of the Hadoop ecosystem and is a non-relational distributed database. It provides a highly reliable, high-performance, scalable, column-oriented storage system suitable for unstructured data storage. As very important big data processing technologies, both Hadoop and HBase have been widely used in many big data fields.
层次聚类算法依赖于空间复杂度为O(n2)的距离矩阵,导致其无法在单机上很好地处理大数据集,同时也无法比较好地扩展。利用Hadoop的并行计算能力和HBase高性能的海量数据存储能力,可以为层次聚类算法提供有效的解决方案。但目前在Hadoop和HBase平台上并没有一种能同时支持single-linkage、complete-linkage和average-linkage多种聚类法的层次聚类算法。The hierarchical clustering algorithm relies on the distance matrix with a space complexity of O(n 2 ), which makes it unable to handle large data sets well on a single machine, and also cannot be expanded relatively well. Utilizing Hadoop's parallel computing capability and HBase's high-performance mass data storage capability, it can provide an effective solution for hierarchical clustering algorithms. But currently there is no hierarchical clustering algorithm that can support single-linkage, complete-linkage and average-linkage multiple clustering methods on Hadoop and HBase platforms.
发明内容Contents of the invention
本发明目的克服现有技术存在的上述不足,提供一种基于Hadoop和HBase的层次聚类方法,并且同时支持single-linkage、complete-linkage和average-linkage三种聚类法,具体技术方案如下。The object of the present invention overcomes the above-mentioned deficiency that prior art exists, provides a kind of hierarchical clustering method based on Hadoop and HBase, and supports single-linkage, complete-linkage and average-linkage three kinds of clustering methods simultaneously, specific technical scheme is as follows.
一种基于Hadoop和HBase的层次聚类方法,其用Hadoop来实现一个距离矩阵并行化计算,并用HBase来存储距离矩阵,在表的设计中采用RowKey设计充分利用HBase的排序功能,再结合多线程技术和cache技术来处理HBase中的距离矩阵,从而实现能应用于大数据的、具有扩展性的层次聚类方法。A hierarchical clustering method based on Hadoop and HBase, which uses Hadoop to implement a distance matrix parallel calculation, and uses HBase to store the distance matrix. In the design of the table, the RowKey design is used to make full use of the sorting function of HBase, combined with multi-threading Technology and cache technology to process the distance matrix in HBase, so as to realize a scalable hierarchical clustering method that can be applied to big data.
进一步地,距离矩阵的并行化计算具体包括:采用Hadoop来实现,主要分成两个MapReduce算法,第一个MapReduce算法计算距离并将结果存储到中间文件,第二个MapReduce算法将中间文件结果转换成HFile格式;最后用Bulk Load方法将结果导入到HBase中。Further, the parallel calculation of the distance matrix specifically includes: using Hadoop to implement, mainly divided into two MapReduce algorithms, the first MapReduce algorithm calculates the distance and stores the result in an intermediate file, and the second MapReduce algorithm converts the intermediate file result into HFile format; finally use the Bulk Load method to import the result into HBase.
进一步地,HBase中主要有两张表,其中一张表的RowKey为Cluster ID对,并将Cluster ID补全前导0保持一样的长度,从而实现表中的记录能按Cluster ID进行排序,value为Cluster间的距离;通过这张表快速地获得与指定Cluster ID相关的距离;另一张表的RowKey为Cluster间的距离,并在距离的前面补全前导0,从而实现表中的记录能按距离从小到大进行排序,value为Cluster ID对;通过获取这张表的首行,能快速地获取距离最近的两个Cluster;Furthermore, there are mainly two tables in HBase. The RowKey of one of the tables is a Cluster ID pair, and the cluster ID is filled with leading 0s to keep the same length, so that the records in the table can be sorted by the Cluster ID, and the value is The distance between Clusters; quickly obtain the distance related to the specified Cluster ID through this table; the RowKey of the other table is the distance between Clusters, and the leading 0 is added in front of the distance, so that the records in the table can be pressed The distance is sorted from small to large, and the value is a Cluster ID pair; by obtaining the first row of this table, you can quickly obtain the two closest Clusters;
在所述聚类方法初始阶段,提前按照Cluster ID和距离对两张表进行预分区,将分区分摊到HBase集群的各个节点上。In the initial stage of the clustering method, the two tables are pre-partitioned according to the Cluster ID and the distance in advance, and the partitions are allocated to each node of the HBase cluster.
进一步地,在聚类的过程中,将Cluster的合并路径和Cluster所包含的所有原子Cluster的数量记录在内存中;采用single-linkage或complete-linkage算法时,计算新合成的Cluster与已有Cluster之间的距离只需从HBase或者cache中取回合成新Cluster的两个Cluster与其余Cluster间的距离来进行新的计算;采用average-linkage算法时,除了需要合成新Cluster的两个Cluster与其他Cluster间的距离,还需要从内存中获得每个Cluster所包含的所有原子Cluster数量来进行平均值的计算。Further, in the process of clustering, record the merging path of the Cluster and the number of all atomic Clusters contained in the Cluster in the memory; when using the single-linkage or complete-linkage algorithm, calculate the newly synthesized Cluster and the existing Cluster The distance between only needs to retrieve the distance between the two Clusters that synthesized the new Cluster and other Clusters from HBase or cache for new calculation; when using the average-linkage algorithm, in addition to the two Clusters that need to synthesize the new Cluster and other For the distance between Clusters, it is also necessary to obtain the number of all atomic Clusters contained in each Cluster from the memory to calculate the average value.
进一步地,结合cache的多线程算法,从HBase获得距离最近的两个Cluster并将它们合并成一个新的Cluster,并并行地从cache或者HBase中获取距离最近的两个Cluster与其他Cluster间的距离信息,在此期间启动删除线程删除HBase中失效数据同时启动计算线程提前计算已经获得距离信息的Cluster与新Cluster间的距离同时并行地将新的距离信息写回cache或者HBase;这里采用cache技术来减少算法的网络IO。Further, combined with the multi-threading algorithm of the cache, the two closest Clusters are obtained from HBase and merged into a new Cluster, and the distance between the two closest Clusters and other Clusters is obtained in parallel from the cache or HBase During this period, the deletion thread is started to delete the invalid data in HBase, and the calculation thread is started to calculate the distance between the cluster that has obtained the distance information and the new Cluster in advance, and at the same time write the new distance information back to the cache or HBase in parallel; here, the cache technology is used to Reduce the network IO of the algorithm.
进一步地,用Hadoop来计算算法距离矩阵,用HBase来存储距离矩阵,再结合多线程技术和cache技术设计和实现了算法,从而提高了层次聚类算法的可扩展性和处理大数据的能力。Furthermore, Hadoop is used to calculate the distance matrix of the algorithm, HBase is used to store the distance matrix, and the algorithm is designed and implemented by combining multi-threading technology and cache technology, thereby improving the scalability of the hierarchical clustering algorithm and the ability to process large data.
与现有技术相比,本发明具有如下优点和技术效果:Compared with the prior art, the present invention has the following advantages and technical effects:
本发明基于Hadoop实现了一个距离矩阵并行化计算算法,并将结果转化为HFile文件用Bulk Load方法导入到HBase中。HBase用来存储距离矩阵,在表的设计中采用独特的RowKey设计充分利用HBase的排序功能,方便算法快速获得距离信息和取得距离最近的两个Cluster。同时结合多线程和cache技术设计和实现了算法并预留了多个可调节参数。通过Hadoop的并行计算能力和HBase的海量数据存储能力,提高了层次聚类算法的扩展性和大数据处理能力。The present invention implements a distance matrix parallelization calculation algorithm based on Hadoop, and converts the result into an HFile file and imports it into HBase with the Bulk Load method. HBase is used to store the distance matrix. In the design of the table, the unique RowKey design is used to make full use of the sorting function of HBase, which is convenient for the algorithm to quickly obtain the distance information and obtain the two closest clusters. At the same time, the algorithm is designed and implemented by combining multithreading and cache technology, and a number of adjustable parameters are reserved. Through Hadoop's parallel computing capability and HBase's massive data storage capability, the scalability and big data processing capability of the hierarchical clustering algorithm are improved.
附图说明Description of drawings
图1为实例中距离矩阵计算算法架构图。Figure 1 is the architecture diagram of the distance matrix calculation algorithm in the example.
图2为实例中表设计示意图。Figure 2 is a schematic diagram of the table design in the example.
图3为实例中层次聚类算法的层次树示意图。Fig. 3 is a schematic diagram of the hierarchical tree of the hierarchical clustering algorithm in the example.
图4为线程执行顺序示意图。FIG. 4 is a schematic diagram of thread execution sequence.
图5为实例中算法参数说明图。Fig. 5 is an explanatory diagram of algorithm parameters in the example.
具体实施方式detailed description
为了使本发明的技术方案及优点更加清楚明白,下面结合附图,进行进一步的详细说明,但本发明的实施和保护不限于此。需指出的是,以下若有未特别详细说明之符号或过程,均是本领域技术人员可参照现有技术理解或实现的。In order to make the technical solutions and advantages of the present invention clearer, further detailed description will be given below in conjunction with the accompanying drawings, but the implementation and protection of the present invention are not limited thereto. It should be pointed out that, if there are symbols or processes in the following that are not specifically described in detail, those skilled in the art can understand or implement them with reference to the prior art.
1.距离矩阵的并行化计算算法1. Parallel calculation algorithm of distance matrix
距离矩阵的并行化计算算法以提高距离矩阵的计算速度和快速导入到HBase为目标。层次聚类算法在聚类的过程中,需要依赖一个空间复杂度为O(n2)的距离矩阵,在本方法中,为距离矩阵的计算设计和实现了一个基于Hadoop的并行化计算算法,如图1所示。该算法首先将存有待聚类数据的文件作为全局缓存文件分发给各个task,再将存有ClusterID的文件进行分块,每个task处理一块。每个task迭代取得各个ID,并计算其与ID值比它大的所有其他Cluster之间的距离,如ID=2的Cluster,则计算其与ID=3,4,5…的Cluster之间的距离,并按(distance ID1,ID2timestamp)的格式将结果输出到中间文件。空间复杂度为O(n2)的距离矩阵,如果按照在reduce中将数据一条一条地插入HBase中的方法则速度会比较慢,在该实现方法中采用了Bulk Load方法来将数据快速导入到HBase中。Bulk Load方法是基于HBase的数据是按特定格式存储在HDFS上这一原理的。所以在该实现方法中接着实现另外的MapReduce程序将中间结果转化为HFile格式的文件,再利用Bulk Load方法将HFile导入到HBase中。利用Hadoop和Bulk Load方法实现了距离矩阵的并行计算和快速导入到HBase。The parallel calculation algorithm of the distance matrix aims to improve the calculation speed of the distance matrix and quickly import it into HBase. In the process of clustering, the hierarchical clustering algorithm needs to rely on a distance matrix with a space complexity of O(n 2 ). In this method, a Hadoop-based parallel computing algorithm is designed and implemented for the calculation of the distance matrix. As shown in Figure 1. The algorithm first distributes the file containing the data to be clustered as a global cache file to each task, and then divides the file containing the ClusterID into blocks, and each task processes one block. Each task iteration obtains each ID, and calculates the distance between it and all other Clusters whose ID value is greater than it, such as the Cluster with ID = 2, calculates the distance between it and the Cluster with ID = 3, 4, 5... distance, and output the result to an intermediate file in the format of (distance ID1, ID2timestamp). For a distance matrix with a space complexity of O(n 2 ), if you insert data into HBase one by one in reduce, the speed will be relatively slow. In this implementation method, the Bulk Load method is used to quickly import data into HBase. The Bulk Load method is based on the principle that HBase data is stored on HDFS in a specific format. Therefore, in this implementation method, another MapReduce program is implemented to convert the intermediate result into a HFile format file, and then the Bulk Load method is used to import the HFile into HBase. Using Hadoop and Bulk Load methods, the parallel calculation of distance matrix and fast import to HBase are realized.
2.表的设计2. Table design
HBase上表的设计对算法的性能有很大的影响。不合理的设计将导致算法性能大大降低从而影响算法的可用性。在该实现中根据层次聚类算法的特征对表进行了独特的设计,如图2所示。HBase中主要有两张表:distanceMatrix表和sortedDistance表。distanceMatrix表存储着按Cluster ID对从小到大进行排序距离矩阵数据,其RowKey为Cluster ID对,并将Cluster ID补全前导0保持RowKey长度一致,实现表中的记录能按Cluster ID进行排序,value为Cluster间的距离。通过这张表可以快速地获得与指定Cluster ID相关的距离。sortedDistance表存储着按距离从小到大排序的距离矩阵数据,其RowKey为Cluster间的距离,并在距离的前面补全前导0保持所有RowKey长度一致,实现表中的记录能按距离从小到大进行排序,value为Cluster ID对。通过获取这张表的首行,可以快速地获取距离最近的两个Cluster。同时,该实现预留了参数来指定两个表的初始region数量,在算法初始阶段,提前按照Cluster ID和距离对两张表进行预分区,将分区后的各个region分摊到HBase集群的各个节点上,提高HBase的并发处理能力。The design of the table on HBase has a great impact on the performance of the algorithm. An unreasonable design will greatly reduce the performance of the algorithm and affect the usability of the algorithm. In this implementation, the table is designed uniquely according to the characteristics of the hierarchical clustering algorithm, as shown in Figure 2. There are two main tables in HBase: distanceMatrix table and sortedDistance table. The distanceMatrix table stores the distance matrix data sorted from small to large according to the Cluster ID pair. Its RowKey is the Cluster ID pair, and the Cluster ID is filled with leading 0 to keep the RowKey length consistent, so that the records in the table can be sorted according to the Cluster ID, value is the distance between Clusters. Through this table, you can quickly obtain the distance related to the specified Cluster ID. The sortedDistance table stores distance matrix data sorted by distance from small to large. Its RowKey is the distance between Clusters, and a leading 0 is added in front of the distance to keep all RowKeys of the same length, so that the records in the table can be sorted from small to large distances. Sort, value is a Cluster ID pair. By obtaining the first row of this table, you can quickly obtain the two closest Clusters. At the same time, the implementation reserves parameters to specify the initial number of regions of the two tables. In the initial stage of the algorithm, the two tables are pre-partitioned according to the Cluster ID and distance, and the partitioned regions are allocated to each node of the HBase cluster. On the one hand, improve the concurrent processing capability of HBase.
3.简化计算3. Simplify the calculation
在计算两个cluster之间的距离的时候,如果两个cluster是由其他cluster合成而来的,则我们需要用到两个cluster之间所有点对的距离,如singleton cluster A和singleton cluster B合成了cluster D,则计算singleton cluster C与cluster D之间的距离的时候需要用到cluster C与singleton cluster A和cluster C与singletoncluster B之间的距离。但其实并不需要每次都比较所有singleton cluster的距离,而只需要比较两个cluster下各个sub cluster之间的距离即可。图3是凝聚式层次聚类算法的一棵层次树。我们定义dis(A,B)为cluster A与cluster B之间的距离,min为求最小值,max为求最大值,avg为求平均值,count(A)为cluster A包含的所有点的个数。则采用single-linkage方法计算cluster 7与cluster 9之间的距离dis(7,9)时有:When calculating the distance between two clusters, if the two clusters are synthesized by other clusters, we need to use the distance of all point pairs between the two clusters, such as singleton cluster A and singleton cluster B synthesis If cluster D is selected, the distance between cluster C and singleton cluster A and the distance between cluster C and singleton cluster B are used to calculate the distance between singleton cluster C and cluster D. But in fact, it is not necessary to compare the distances of all singleton clusters every time, but only need to compare the distances between the sub clusters under the two clusters. Figure 3 is a hierarchical tree of the agglomerative hierarchical clustering algorithm. We define dis(A,B) as the distance between cluster A and cluster B, min as the minimum value, max as the maximum value, avg as the average value, and count(A) as the number of all points contained in cluster A number. When using the single-linkage method to calculate the distance dis(7,9) between cluster 7 and cluster 9:
dis(7,9)=min(dis(1,5),dis(1,6),dis(2,5),dis(2,6),dis(3,5),dis(3,6))dis(7,9)=min(dis(1,5),dis(1,6),dis(2,5),dis(2,6),dis(3,5),dis(3,6) )
=min(min(dis(1,5),dis(1,6)),min(dis(2,5),dis(2,6)),min(dis(3,5),dis(3,6)))=min(min(dis(1,5),dis(1,6)),min(dis(2,5),dis(2,6)),min(dis(3,5),dis(3, 6)))
=min(dis(1,7),dis(2,7),dis(3,7))=min(dis(1,7),dis(2,7),dis(3,7))
=min(dis(1,7),min(dis(2,7),dis(3,7)))=min(dis(1,7),min(dis(2,7),dis(3,7)))
=min(dis(1,7),dis(8,7))=min(dis(1,7),dis(8,7))
可见,计算cluster 7与cluster 9之间的距离dis(7,9)时只需要dis(1,7)、dis(8,7),而不需要cluster1-4与cluster4-5的所有距离。complete-linkage方法与single-linkage方法类似,只需要把以上公式的min换成max。而采用average-linkage方法计算cluster 7与cluster 9之间的距离dis(7,9)时有:It can be seen that when calculating the distance dis(7,9) between cluster 7 and cluster 9, only dis(1,7) and dis(8,7) are needed, instead of all the distances between cluster1-4 and cluster4-5. The complete-linkage method is similar to the single-linkage method, only need to replace the min in the above formula with max. When using the average-linkage method to calculate the distance dis(7,9) between cluster 7 and cluster 9:
可见,采用average-linkage时不仅需要dis(1,7)、dis(8,7),还需要用到cluster7and cluster9各自所包含的点的个数,以便可以计算平均值。It can be seen that when using average-linkage, not only dis(1,7) and dis(8,7), but also the number of points contained in cluster7 and cluster9 are required, so that the average value can be calculated.
4.结合cache的多线程算法的设计与实现4. Design and implementation of multi-threaded algorithm combined with cache
除了距离矩阵的高效存取方法之外,还需要一个并行化的算法来完成聚类过程。在算法中,我们结合了多线程技术和cache技术来完成算法的设计与实现。按层次聚类算法原理,首先需要获取距离最近的两个cluster,并将它们合并成一个新的cluster。正如前文所介绍的,sortedDistance表中存储着cluster对及它们之间的距离,最主要的是它们是按距离从小到大排序,所以我们只需要获取sortedDistance表的第一条记录。我们利用HBase的scan API从sortedDistance表中获取第一条数据,为了避免获取多余的数据我们将scan的cache设置为1。In addition to an efficient access method for the distance matrix, a parallel algorithm is needed to complete the clustering process. In the algorithm, we combine multi-thread technology and cache technology to complete the design and implementation of the algorithm. According to the principle of hierarchical clustering algorithm, it is first necessary to obtain the two closest clusters and merge them into a new cluster. As mentioned above, the sortedDistance table stores cluster pairs and the distance between them. The most important thing is that they are sorted by distance from small to large, so we only need to get the first record of the sortedDistance table. We use HBase's scan API to get the first piece of data from the sortedDistance table. In order to avoid getting redundant data, we set the scan cache to 1.
为了方便说明,假设初始数据集有10个点,开始的时候每个点单独作为一个cluster,则初始时有10个cluster:C1到C10。假如第一次迭代的时候距离最近的两个cluster为C1和C2,并且我们将他们合并成一个新的cluster C11。根据上一小节所说明的,我们需要C1、C2与cluster C3到C10之间的距离来计算C11与C3...C10,并将新距离写到cache或者HBase中。在没命中cache的情况下,我们需要从HBase中读取距离,因为这是一个IO密集的操作,所以我们采用了多线程的方式,并发地使用Scan API从distanceMatric表中读取相关的距离,同时我们通过scan.setCaching调整Scan每次读回的行数。在从distanceMatric表中读取距离的过程,我们也并行地启动多个线程从sortedDistance表中删除与已经合并的两个cluster相关的距离,显然,get和put操作都是IO密集型的操作,所以我们可以提前对已经获取的数据进行计算,而不必等到所有数据获取完。所以在我们启动scan线程去HBase中读取数据的同时我们也启动计算线程进行计算而不是等待scan线程结束。并且我们也并行地启动多线程将新距离写回cache或HBase,这样在获取完所有的数据的时候也已经基本完成了计算任务。For the convenience of illustration, assume that the initial data set has 10 points, each point is regarded as a cluster at the beginning, and there are 10 clusters at the beginning: C1 to C10. Suppose the two nearest clusters are C1 and C2 in the first iteration, and we merge them into a new cluster C11. According to the previous section, we need the distance between C1, C2 and cluster C3 to C10 to calculate C11 and C3...C10, and write the new distance to cache or HBase. In the case of missing the cache, we need to read the distance from HBase, because this is an IO-intensive operation, so we use a multi-threaded method to concurrently use the Scan API to read the relevant distance from the distanceMatric table, At the same time, we adjust the number of lines that Scan reads back each time through scan.setCaching. In the process of reading the distance from the distanceMatric table, we also start multiple threads in parallel to delete the distances related to the two clusters that have been merged from the sortedDistance table. Obviously, get and put operations are IO-intensive operations, so We can calculate the acquired data in advance without having to wait until all the data has been acquired. So when we start the scan thread to read data in HBase, we also start the calculation thread to perform calculations instead of waiting for the scan thread to end. And we also start multiple threads in parallel to write the new distance back to the cache or HBase, so that when all the data is obtained, the calculation task has basically been completed.
各个并行线程的执行顺序如图4所示。如我们所知,我们需要一些同步技术来控制线程。在我们的算法中,我们采用了BlockingQueue和Barrier技术。BlockingQueue通过控制两个线程交替地向BlockingQueue中放入、取出元素来控制线程的通信。而Barrier技术在并行迭代算法中非常有用,这种算法将一个问题拆分成多个子问题,并并行地执行,并且在到达栅栏位置的时候会等待,直到所有线程都到达了栅栏位置。The execution sequence of each parallel thread is shown in FIG. 4 . As we know, we need some synchronization techniques to control threads. In our algorithm, we use BlockingQueue and Barrier technology. BlockingQueue controls thread communication by controlling two threads to alternately put and take elements into BlockingQueue. The Barrier technology is very useful in parallel iterative algorithms. This algorithm splits a problem into multiple sub-problems and executes them in parallel. When it reaches the barrier position, it will wait until all threads have reached the barrier position.
5.可调参数说明5. Description of adjustable parameters
Hadoop和HBase在使用的过程中,有许多的参数需要根据实际地应用需求进行调整才能达到一个比较好性能。在我们的算法的设计过程中,除了Hadoop、HBase本身支持的参数,我们还预留了许多算法自定义的参数方便在使用算法的过程中根据实际情况进行调整,主要的参数如图5所示。接下来将介绍主要的自定义参数。In the process of using Hadoop and HBase, there are many parameters that need to be adjusted according to the actual application requirements to achieve a better performance. In the design process of our algorithm, in addition to the parameters supported by Hadoop and HBase itself, we also reserved many algorithm-defined parameters for the convenience of adjusting according to the actual situation in the process of using the algorithm. The main parameters are shown in Figure 5 . The main customization parameters are described next.
在层次聚类算法计算过程中产生的距离矩阵比较大,我们将其放在HBase的表中并在客户端增加了一层cache。为了充分利用HBase集群多个机器的性能,比较好的做法是将负载分摊到各个机器中去,所以在建表的时候我们对表进行预分区,将一个表拆分到多个region,然后每个服务器各负责一部分region。我们提供了regionCountDM和regionCountSD参数用于分别指定将distanceMatric表和sortedDistance表分为多少个region。同时,我们在客户端维护了一层cache,以提高数据查找的性能,并提供了cacheSize参数方便根据实际情况调整cache缓存的记录条数。与层次聚类算法的single-linkage,complete-linkage和average-linkage方法对应,我们提供了similarity_method参数用于指定采用哪一种方法。同时还提供了distance_method用于指定使用哪一种距离度量方法来计算两个点之间的距离,测试的时候我们只实现了Euclidean方法。在算法的计算过程我们使用了多线程技术,所以也提供了参数可以对线程数进行调整,我们可以通过调整putThreadNum参数来调整put线程数,同时可以调整pagesNum参数来控制从HBase读取数据的线程数,我们将需要从HBase中读取的数据进行了分页,每一个线程读取一页,pagesNum参数控制着页数,如pagesNum=10,我们需要读取ID为0到10000的数据,则分为10个线程,分别读取0到1000,1001到2000以此类推。最后,为了控制层次聚类算法的结束条件,可以通过设置maxClusterNum参数来指定将数据聚成多少个类,也可以通过设置minDistance参数来指定当两个cluster之间的最小距离小于minDistance指定的值时结束聚类。The distance matrix generated during the calculation of the hierarchical clustering algorithm is relatively large. We put it in the HBase table and add a layer of cache on the client side. In order to make full use of the performance of multiple machines in the HBase cluster, it is better to distribute the load to each machine. Therefore, when building a table, we pre-partition the table, split a table into multiple regions, and then Each server is responsible for a part of the region. We provide regionCountDM and regionCountSD parameters to specify how many regions the distanceMatric table and sortedDistance table are divided into. At the same time, we maintain a layer of cache on the client side to improve the performance of data lookup, and provide the cacheSize parameter to adjust the number of cached records according to the actual situation. Corresponding to the single-linkage, complete-linkage and average-linkage methods of the hierarchical clustering algorithm, we provide the similarity_method parameter to specify which method to use. At the same time, distance_method is also provided to specify which distance measurement method to use to calculate the distance between two points. We only implemented the Euclidean method during the test. In the calculation process of the algorithm, we use multi-threading technology, so we also provide parameters to adjust the number of threads. We can adjust the number of put threads by adjusting the putThreadNum parameter, and at the same time adjust the pagesNum parameter to control the thread that reads data from HBase number, we will page the data that needs to be read from HBase, each thread reads one page, and the pagesNum parameter controls the number of pages, such as pagesNum=10, we need to read data with IDs from 0 to 10000, then divide For 10 threads, read 0 to 1000, 1001 to 2000 and so on. Finally, in order to control the end condition of the hierarchical clustering algorithm, you can set the maxClusterNum parameter to specify how many clusters the data will be clustered into, or you can set the minDistance parameter to specify when the minimum distance between two clusters is less than the value specified by minDistance End clustering.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610851970.1A CN106484818B (en) | 2016-09-26 | 2016-09-26 | A Hierarchical Clustering Method Based on Hadoop and HBase |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610851970.1A CN106484818B (en) | 2016-09-26 | 2016-09-26 | A Hierarchical Clustering Method Based on Hadoop and HBase |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484818A true CN106484818A (en) | 2017-03-08 |
CN106484818B CN106484818B (en) | 2023-04-28 |
Family
ID=58268853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610851970.1A Active CN106484818B (en) | 2016-09-26 | 2016-09-26 | A Hierarchical Clustering Method Based on Hadoop and HBase |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484818B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106932184A (en) * | 2017-03-15 | 2017-07-07 | 国网四川省电力公司广安供电公司 | A kind of Diagnosis Method of Transformer Faults based on improvement hierarchical clustering |
CN112668622A (en) * | 2020-12-22 | 2021-04-16 | 中国矿业大学(北京) | Analysis method and analysis and calculation device for coal geological composition data |
CN113268333A (en) * | 2021-06-21 | 2021-08-17 | 成都深思科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core calculation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965823A (en) * | 2015-07-30 | 2015-10-07 | 成都鼎智汇科技有限公司 | Big data based opinion extraction method |
-
2016
- 2016-09-26 CN CN201610851970.1A patent/CN106484818B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965823A (en) * | 2015-07-30 | 2015-10-07 | 成都鼎智汇科技有限公司 | Big data based opinion extraction method |
Non-Patent Citations (1)
Title |
---|
徐小龙;李永萍;: "一种基于MapReduce的知识聚类与统计机制" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106932184A (en) * | 2017-03-15 | 2017-07-07 | 国网四川省电力公司广安供电公司 | A kind of Diagnosis Method of Transformer Faults based on improvement hierarchical clustering |
CN112668622A (en) * | 2020-12-22 | 2021-04-16 | 中国矿业大学(北京) | Analysis method and analysis and calculation device for coal geological composition data |
CN113268333A (en) * | 2021-06-21 | 2021-08-17 | 成都深思科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core calculation |
CN113268333B (en) * | 2021-06-21 | 2024-03-19 | 成都锋卫科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core computing |
Also Published As
Publication number | Publication date |
---|---|
CN106484818B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mahmud et al. | A survey of data partitioning and sampling methods to support big data analysis | |
Floratou et al. | Column-oriented storage techniques for MapReduce | |
Liao et al. | Multi-dimensional index on hadoop distributed file system | |
Sharma et al. | A brief review on leading big data models | |
US10579661B2 (en) | System and method for machine learning and classifying data | |
Khazaei et al. | How do I choose the right NoSQL solution? A comprehensive theoretical and experimental survey | |
JP2019053772A (en) | System and method of modeling object network | |
JP6598996B2 (en) | Signature-based cache optimization for data preparation | |
CN104881466B (en) | The processing of data fragmentation and the delet method of garbage files and device | |
JP2017512338A (en) | Implementation of semi-structured data as first class database elements | |
Humbetov | Data-intensive computing with map-reduce and hadoop | |
WO2013185852A1 (en) | A system and method to store video fingerprints on distributed nodes in cloud systems | |
CN103942301B (en) | Distributed file system oriented to access and application of multiple data types | |
JP6598997B2 (en) | Cache optimization for data preparation | |
CN106484818B (en) | A Hierarchical Clustering Method Based on Hadoop and HBase | |
Dehne et al. | A distributed tree data structure for real-time OLAP on cloud architectures | |
Qiao et al. | Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope | |
CN104809210B (en) | One kind is based on magnanimity data weighting top k querying methods under distributed computing framework | |
Park et al. | KV-CSD: A hardware-accelerated key-value store for data-intensive applications | |
Cheng et al. | A Multi-dimensional Index Structure Based on Improved VA-file and CAN in the Cloud | |
Carter et al. | Nanosecond indexing of graph data with hash maps and VLists | |
Perwej et al. | An extensive investigate the mapreduce technology | |
Doulkeridis et al. | Parallel and distributed processing of spatial preference queries using keywords | |
CN108121807B (en) | Implementation method of multi-dimensional index structure OBF-Index in Hadoop environment | |
Rathidevi et al. | Performance Analysis of small files in HDFS using clustering small files based on centroid algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |