CN106484818A

CN106484818A - A kind of hierarchy clustering method based on Hadoop and HBase

Info

Publication number: CN106484818A
Application number: CN201610851970.1A
Authority: CN
Inventors: 刘发贵; 周晓场
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2017-03-08
Anticipated expiration: 2036-09-26
Also published as: CN106484818B

Abstract

The invention discloses a kind of hierarchy clustering method based on Hadoop and HBase.The method calculates distance matrix by Hadoop, and result is converted into HFile file Bulk Load method imports in HBase, HBase is used for storing distance matrix, it is broadly divided into two tables, one of table is by cluster ID to being ranked up, another table is ranked up by the distance between cluster, can conveniently be realized two closest cluster of each iteration taking-up and be merged.The last distance matrix that realizes the algorithm of a multithreading again and be uniformly processed with reference to cache technology in HBase, implementation level clustering algorithm simultaneously reserves multiple customized parameters, and the algorithm is while support single linkage, tri- kinds of clustering procedures of complete linkage and average linkage.Scheme proposed by the present invention make use of the computation capability of Hadoop and the mass data storage capability of HBase, so as to improve big data disposal ability and the autgmentability of hierarchical clustering algorithm.

Description

A kind of hierarchy clustering method based on Hadoop and HBase

Technical field

The present invention relates to hierarchical clustering algorithm, Hadoop and HBase correlative technology field, more particularly to one kind is based on The design of the hierarchy clustering method of Hadoop and HBase and realization.

Background technology

Hierarchical clustering algorithm is answered at a lot of aspects as a simple and widely accepted clustering algorithm With, such as information retrieval, bioinformatics.The advantage of hierarchical clustering algorithm is its side by cluster result with a ratio in greater detail Method shows.Clustering relationships between cluster are organized into a dendrogram by it, and user is apparent that each cluster It is how to be clustered together, other a lot of clustering algorithms do not provide such result.And, with k-means etc. Other clustering algorithms are compared, hierarchical clustering algorithm without require user in advance specify cluster quantity.Although hierarchical clustering algorithm With lot of advantages and it is widely accepted and used, but the rapid growth with data volume, the hierarchical clustering algorithm of unit Performance cannot meet demand of wanting, and the high complexity of hierarchical clustering algorithm and intrinsic data dependency cause it to be difficult big Effectively execute on data set.But often only from more sufficient data, just more useful information can be extracted, data set Size becomes a critically important factor in machine learning algorithm.This promotes us to may operate in big number in the urgent need to one According to the hierarchical clustering algorithm on collection.

Hadoop is a software frame that can carry out distributed treatment to mass data, it provide a kind of reliable, Efficiently, telescopic mode is carrying out data processing.And HBase is another important member in the Hadoop ecosphere, it is one The distributed data base of non-relational.It provides a high reliability, high-performance, scalable storage system towards row, fits Storage together in unstructured data.Used as very important big data treatment technology, Hadoop and HBase is in many big numbers It is widely used according to field.

Hierarchical clustering algorithm depends on space complexity for O (n²) distance matrix, cause which cannot be on unit very well Ground processes large data sets, while also cannot extend better.Computation capability and HBase using Hadoop is high performance Mass data storage capability, can provide effective solution for hierarchical clustering algorithm.But it is flat in Hadoop and HBase at present Do not have one kind on platform and can support many kinds of single-linkage, complete-linkage and average-linkage simultaneously The hierarchical clustering algorithm of clustering procedure.

Content of the invention

The object of the invention overcomes deficiencies of the prior art, provides a kind of level based on Hadoop and HBase Clustering method, and while support tri- kinds of clusters of single-linkage, complete-linkage and average-linkage Method, concrete technical scheme are as follows.

A kind of hierarchy clustering method based on Hadoop and HBase, it is parallel which realizes a distance matrix with Hadoop Change and calculate, and distance matrix is stored with HBase, make full use of the sequence of HBase in the design of table using RowKey design Function, processes the distance matrix in HBase in conjunction with multithreading and cache technology, so as to realize can apply to big number According to, hierarchy clustering method with autgmentability.

Further, the parallelization of distance matrix is calculated and is specifically included：Realized using Hadoop, be largely divided into two MapReduce algorithm, first MapReduce algorithm calculate distance and store the result into intermediate file, second Intermediate file result is converted into HFile form by MapReduce algorithm；Finally result is imported to Bulk Load method In HBase.

Further, mainly there are two tables in HBase, the wherein RowKey of a table is Cluster ID couple, and will Cluster ID completion leading 0 keeps the same length, and the record so as to realize in table can be ranked up by Cluster ID, Value is the distance between Cluster；The distance related to specified Cluster ID is rapidly obtained by this table；Another The RowKey of table is the distance between Cluster, and completion leading 0 before the distance, so as to realize record in table can by away from From being ranked up from small to large, value is Cluster ID couple；By the first trip of this table is obtained, distance can be rapidly obtained Two nearest Cluster；

In the clustering method starting stage, pre- subregion is carried out to two tables according to Cluster ID and distance in advance, will Subregion is shared on each node of HBase cluster.

Further, during cluster, Cluster merged all atoms that path and Cluster are included The quantity of Cluster is recorded in internal memory；During using single-linkage or complete-linkage algorithm, new conjunction is calculated The distance between the Cluster for becoming and existing Cluster only need to fetch the two of the new Cluster of synthesis from HBase or cache Distance between individual Cluster and remaining Cluster is carrying out new calculating；During using average-linkage algorithm, except needing Synthesize the distance between two Cluster of new Cluster and other Cluster, in addition it is also necessary to from internal memory, obtain each All atom Cluster quantity that Cluster is included are carrying out the calculating of mean value.

Further, in conjunction with the multithreading algorithm of cache, from HBase obtain closest two Cluster and by it Be merged into a new Cluster, and concurrently obtain from cache or HBase closest two Cluster with Range information between other Cluster, starts deletion thread during this period and deletes fail data in HBase while startup calculates line Cheng Tiqian calculates the distance obtained between the Cluster of range information and new Cluster to be believed new distance simultaneously in parallel Breath writes back cache or HBase；Here reduce the network I/O of algorithm using cache technology.

Further, with Hadoop come computational algorithm distance matrix, distance matrix is stored with HBase, in conjunction with multi-thread Journey technology and cache Technology design and algorithm is achieved, so as to improve the extensibility of hierarchical clustering algorithm and processing big number According to ability.

Compared with prior art, the invention has the advantages that and technique effect：

The present invention achieves a distance matrix parallelization computational algorithm based on Hadoop, and result is converted into HFile File is imported in HBase with Bulk Load method.HBase is used for storing distance matrix, using uniqueness in the design of table RowKey design makes full use of the ranking function of HBase, facilitates algorithm quickly to obtain range information and obtain closest two Individual Cluster.In combination with multithreading and cache Technology design and achieve algorithm and reserved multiple customized parameters.Logical Cross the computation capability of Hadoop and the mass data storage capability of HBase, improve hierarchical clustering algorithm autgmentability and Big data disposal ability.

Description of the drawings

Fig. 1 is for apart from matrix computations algorithm Organization Chart in example.

Fig. 2 is table design diagram in example.

Fig. 3 is the hierarchical tree schematic diagram of the middle-level clustering algorithm of example.

Fig. 4 is thread execution order schematic diagram.

Fig. 5 is algorithm parameter explanatory diagram in example.

Specific embodiment

In order that technical scheme and advantage become more apparent, below in conjunction with the accompanying drawings, carry out further detailed Describe in detail bright, but the enforcement of the present invention and protection not limited to this.If it is noted that have below not especially the symbol of detailed description or Process, is all that those skilled in the art can refer to prior art understanding or realize.

1. the parallelization computational algorithm of distance matrix

The parallelization computational algorithm of distance matrix with the calculating speed that improves distance matrix and quickly introduces HBase as mesh Mark.Hierarchical clustering algorithm needs to rely on a space complexity for O (n during cluster²) distance matrix, in we In method, it is that the calculating of distance matrix has devised and embodied a parallelization computational algorithm based on Hadoop, as shown in Figure 1.Should The file for having data to be clustered is given each task as global buffer file distributing by algorithm first, then will have Cluster The file of ID carries out piecemeal, and each task processes one piece.Each task iteration obtains each ID, and it is bigger than it with ID value to calculate which The distance between every other Cluster, the such as Cluster of ID=2, then calculate itself and ID=3,4,5 ... Cluster it Between distance, and by (distance ID1, ID2timestamp) form by result export to intermediate file.Space complexity For O (n²) distance matrix, speed meeting if according to the method that inserts data item by item in reduce in HBase Slow, Bulk Load method is employed in the implementation method quickly introduces data in HBase.Bulk Load side It is to be stored in this principle on HDFS by specific format based on the data of HBase that method is.So in the implementation method then The file that intermediate result is converted into other MapReduce program HFile form is realized, recycling Bulk Load method will HFile is imported in HBase.The parallel computation of distance matrix is achieved using Hadoop and Bulk Load method and quickly lead Enter to HBase.

2. the design of table

The design of HBase upper table has a great impact to the performance of algorithm.Irrational design will cause algorithm performance big The big availability reduced so as to affect algorithm.Uniqueness is carried out according to the feature of hierarchical clustering algorithm to table in this implementation to set Meter, as shown in Figure 2.Mainly there are two tables in HBase：DistanceMatrix table and sortedDistance table. DistanceMatrix table is store by Cluster ID to being ranked up distance matrix data from small to large, and its RowKey is Cluster ID couple, and leading for Cluster ID completion 0 RowKey length will be kept consistent, realizing record in table can be by Cluster ID is ranked up, and value is the distance between Cluster.Can rapidly be obtained by this table and specify The related distance of Cluster ID.SortedDistance table stores the distance matrix data for being sorted by distance from small to large, Its RowKey is the distance between Cluster, and above completion leading 0 keeps all RowKey length consistent in distance, realization Record in table can be ranked up from small to large by distance, and value is Cluster ID couple.By obtaining the first trip of this table, Closest two Cluster can rapidly be obtained.Meanwhile, the realization has been reserved parameter to specify the initial of two tables Two tables, in the algorithm starting stage, are carried out pre- subregion according to Cluster ID and distance, after subregion by region quantity in advance Each region share on each node of HBase cluster, improve HBase oncurrent processing ability.

3. simplify and calculate

Calculate two cluster between apart from when, if two cluster are synthesized by other cluster And come, then we need to use the distance of all-pair between two cluster, such as singleton cluster A and Singleton cluster B has synthesized cluster D, then calculate between singleton cluster C and cluster D Apart from when need to use cluster C and singleton cluster A and cluster C and singleton The distance between cluster B.But need not compare in fact and every time the distance of all singleton cluster, and only Need to compare the distance between each sub cluster under two cluster.Fig. 3 is Agglomerative hierarchical clustering algorithm One hierarchical tree.We define dis (A, B) for the distance between cluster A and cluster B, min for minimizing, max For maximizing, avg for averaging, count (A) for cluster A include number a little.Single- is then adopted Linkage method has when calculating the distance between cluster 7 and cluster 9 dis (7,9)：

Dis (7,9)=min (dis (1,5), dis (1,6), dis (2,5), dis (2,6), dis (3,5), dis (3,6))

=min (min (dis (1,5), dis (1,6)), min (dis (2,5), dis (2,6)), min (dis (3,5), dis (3,6)))

=min (dis (1,7), dis (2,7), dis (3,7))

=min (dis (1,7), min (dis (2,7), dis (3,7)))

=min (dis (1,7), dis (8,7))

It can be seen that, dis (1,7), dis is only needed to when calculating the distance between cluster 7 and cluster 9 dis (7,9) (8,7), without all distances of cluster1-4 and cluster4-5.Complete-linkage method and single- Linkage method is similar to, it is only necessary to which the min of above formula is changed into max.And adopt average-linkage method to calculate Have during the distance between cluster 7 and cluster 9 dis (7,9)：

It can be seen that, using not only needing dis (1,7), dis (8,7) during average-linkage, in addition it is also necessary to use The number of the point each included by cluster7and cluster9, so as to calculate mean value.

4. design and the realization of the multithreading algorithm of cache are combined

In addition to the efficient access method of distance matrix, in addition it is also necessary to which the algorithm of a parallelization is completing cluster process. In the algorithm, we combine multithreading and cache technology to complete design and the realization of algorithm.By hierarchical clustering algorithm Principle, it is necessary first to obtain closest two cluster, and combine them into a new cluster.As discussed above Introduced, store in sortedDistance table cluster to and they the distance between, be most importantly they be by Distance is sorted from small to large, so we only need to obtain first record of sortedDistance table.We utilize HBase Scan API obtain the first data from sortedDistance table, in order to avoid obtain unnecessary data we by scan Cache be set to 1.

For convenience of explanation, it is assumed that initial data set has 10 points, at first each o'clock separately as one Cluster, then have 10 cluster when initial：C1 to C10.If closest two when first time iteration Cluster is C1 and C2, and they are merged into a new cluster C11 by us.According to illustrated by a upper trifle, We need the distance between C1, C2 and cluster C3 to C10 to calculate C11 and C3...C10, and new distance is write In cache or HBase.In recklessly in the case of cache, it would be desirable to read distance from HBase, because this is one The operation of I/O intensive, so we employ the mode of multithreading, concomitantly using Scan API from distanceMatric table The middle distance for reading correlation, while we adjust the line number that Scan is read back every time by scan.setCaching.From The process of distance is read in distanceMatric table, and we also concurrently start multiple threads from sortedDistance table Delete the distance related to two cluster for having merged, it is clear that get and put operation is all the operation of I/O intensive type, institute Calculated with the data that we can be in advance to having obtained, complete without waiting until all data acquisitions.So opening at us While dynamic scan thread goes to read data in HBase, we also start computational threads and are calculated rather than waited scan thread Terminate.And we also concurrently start multithreading and new distance are write back cache or HBase, are so obtaining all of number According to when be also basically completed calculating task.

The execution sequence of each parallel thread is as shown in Figure 4.It is known that we need some simultaneous techniques to control Thread.In our algorithm, we employ BlockingQueue and Barrier technology.BlockingQueue is by control Two threads of system are alternately put into, take out the communication of first usually control line journey in BlockingQueue.And Barrier technology Highly useful in parallel iterative algorithm, a problem is split into multiple subproblems by this algorithm, and is performed in parallel, and Can wait until that all threads all reach column location when column location is reached.

5. adjustable parameter explanation

, during use, the parameter for having many needs to be adjusted according to practically application demand for Hadoop and HBase Whole can be only achieved a relatively good performance.In the design process of our algorithm, except Hadoop, HBase support itself Parameter, we have also reserved, and the self-defining parameter of many algorithms is convenient to be adjusted according to actual conditions during using algorithm Whole, main parameter is as shown in Figure 5.Next main custom parameter will be introduced.

Than larger, we place it in the table of HBase the distance matrix produced in hierarchical clustering algorithm calculating process And one layer of cache is increased in client.In order to make full use of the performance of the multiple machines of HBase cluster, reasonable way is Load is shared in each machine, so we carry out pre- subregion to table when table is built, a table is split to many Individual region, then each server is each is responsible for a part of region.We provide regionCountDM and DistanceMatric table and sortedDistance table are divided into how many for respectively specifying that by regionCountSD parameter region.Meanwhile, we are one layer of cache in client maintenance, to improve the performance of data search, and provide The convenient record strip number for adjusting cache caching according to actual conditions of cacheSize parameter.Single- with hierarchical clustering algorithm Linkage, complete-linkage and average-linkage method is corresponding, we provides similarity_method Parameter is used for specifying using any method.While additionally providing distance_method for specifying using any distance Calculating the distance between two points, when test, we only achieve Euclidean method to measure.Meter in algorithm Calculation process we used multithreading, so also provide parameter can be adjusted to Thread Count, we can pass through Adjust putThreadNum parameter to adjust put Thread Count, while pagesNum parameter can be adjusted reading from HBase to control The data for needing to read from HBase have been carried out paging by the Thread Count of data, we, and each thread reads one page, PagesNum state modulator number of pages, such as pagesNum=10, it would be desirable to reads the data that ID is 0 to 10000, is then divided into 10 Individual thread, reads 0 to 1000,1001 to 2000 by that analogy respectively.Finally, for the end bar of control hierarchy clustering algorithm Data can be polymerized to how many classes by arranging maxClusterNum parameter to specify, it is also possible to by arranging by part MinDistance parameter terminates when the minimum range between two cluster is less than the value that minDistance is specified to specify Cluster.

Claims

1. a kind of hierarchy clustering method based on Hadoop and HBase, it is characterised in that realize with Hadoop apart from square Battle array parallelization is calculated, and stores distance matrix with HBase, makes full use of HBase using RowKey design in the design of table Ranking function, process the distance matrix in HBase in conjunction with multithreading and cache technology, so as to realize applying Hierarchy clustering method in big data, with autgmentability.

2. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that distance matrix Parallelization is calculated and is specifically included：Realized using Hadoop, be largely divided into two MapReduce algorithms, first MapReduce Algorithm calculates distance and stores the result into intermediate file, and intermediate file result is converted into by second MapReduce algorithm HFile form；Finally result is imported in HBase with Bulk Load method.

3. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that main in HBase There are two tables, the wherein RowKey of a table is Cluster ID couple, and as leading for Cluster ID completion 0 is kept Length, the record so as to realize in table can be ranked up by Cluster ID, and value is the distance between Cluster；By this Table rapidly obtains the distance related to specified Cluster ID；The RowKey of another table is the distance between Cluster, and Completion leading 0 before distance, the record so as to realize in table can be ranked up from small to large by distance, and value is Cluster ID pair；By the first trip of this table is obtained, closest two Cluster can be rapidly obtained；

In the clustering method starting stage, pre- subregion is carried out to two tables according to Cluster ID and distance in advance, by subregion Share on each node of HBase cluster.

4. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that：In cluster During, the quantity for merging all atom Cluster that path and Cluster are included of Cluster is recorded in internal memory； During using single-linkage or complete-linkage algorithm, calculate new synthesis Cluster and existing Cluster it Between distance only need to fetch from HBase or cache between two Cluster and remaining Cluster for synthesizing new Cluster Distance is carrying out new calculating；During using average-linkage algorithm, except needing to synthesize two of new Cluster Distance between Cluster and other Cluster, in addition it is also necessary to obtain all atoms that each Cluster is included from internal memory Cluster quantity is carrying out the calculating of mean value.

5. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that combine cache Multithreading algorithm, obtain closest two Cluster from HBase and combine them into a new Cluster, and Range information between closest two Cluster and other Cluster is concurrently obtained from cache or HBase, Start during this fail data in thread deletion HBase is deleted while startup computational threads are calculated in advance and obtained range information Cluster and new Cluster between distance simultaneously in parallel new range information is write back cache or HBase.