CN106484818A - A kind of hierarchy clustering method based on Hadoop and HBase - Google Patents
A kind of hierarchy clustering method based on Hadoop and HBase Download PDFInfo
- Publication number
- CN106484818A CN106484818A CN201610851970.1A CN201610851970A CN106484818A CN 106484818 A CN106484818 A CN 106484818A CN 201610851970 A CN201610851970 A CN 201610851970A CN 106484818 A CN106484818 A CN 106484818A
- Authority
- CN
- China
- Prior art keywords
- cluster
- hbase
- distance
- hadoop
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of hierarchy clustering method based on Hadoop and HBase.The method calculates distance matrix by Hadoop, and result is converted into HFile file Bulk Load method imports in HBase, HBase is used for storing distance matrix, it is broadly divided into two tables, one of table is by cluster ID to being ranked up, another table is ranked up by the distance between cluster, can conveniently be realized two closest cluster of each iteration taking-up and be merged.The last distance matrix that realizes the algorithm of a multithreading again and be uniformly processed with reference to cache technology in HBase, implementation level clustering algorithm simultaneously reserves multiple customized parameters, and the algorithm is while support single linkage, tri- kinds of clustering procedures of complete linkage and average linkage.Scheme proposed by the present invention make use of the computation capability of Hadoop and the mass data storage capability of HBase, so as to improve big data disposal ability and the autgmentability of hierarchical clustering algorithm.
Description
Technical field
The present invention relates to hierarchical clustering algorithm, Hadoop and HBase correlative technology field, more particularly to one kind is based on
The design of the hierarchy clustering method of Hadoop and HBase and realization.
Background technology
Hierarchical clustering algorithm is answered at a lot of aspects as a simple and widely accepted clustering algorithm
With, such as information retrieval, bioinformatics.The advantage of hierarchical clustering algorithm is its side by cluster result with a ratio in greater detail
Method shows.Clustering relationships between cluster are organized into a dendrogram by it, and user is apparent that each cluster
It is how to be clustered together, other a lot of clustering algorithms do not provide such result.And, with k-means etc.
Other clustering algorithms are compared, hierarchical clustering algorithm without require user in advance specify cluster quantity.Although hierarchical clustering algorithm
With lot of advantages and it is widely accepted and used, but the rapid growth with data volume, the hierarchical clustering algorithm of unit
Performance cannot meet demand of wanting, and the high complexity of hierarchical clustering algorithm and intrinsic data dependency cause it to be difficult big
Effectively execute on data set.But often only from more sufficient data, just more useful information can be extracted, data set
Size becomes a critically important factor in machine learning algorithm.This promotes us to may operate in big number in the urgent need to one
According to the hierarchical clustering algorithm on collection.
Hadoop is a software frame that can carry out distributed treatment to mass data, it provide a kind of reliable,
Efficiently, telescopic mode is carrying out data processing.And HBase is another important member in the Hadoop ecosphere, it is one
The distributed data base of non-relational.It provides a high reliability, high-performance, scalable storage system towards row, fits
Storage together in unstructured data.Used as very important big data treatment technology, Hadoop and HBase is in many big numbers
It is widely used according to field.
Hierarchical clustering algorithm depends on space complexity for O (n2) distance matrix, cause which cannot be on unit very well
Ground processes large data sets, while also cannot extend better.Computation capability and HBase using Hadoop is high performance
Mass data storage capability, can provide effective solution for hierarchical clustering algorithm.But it is flat in Hadoop and HBase at present
Do not have one kind on platform and can support many kinds of single-linkage, complete-linkage and average-linkage simultaneously
The hierarchical clustering algorithm of clustering procedure.
Content of the invention
The object of the invention overcomes deficiencies of the prior art, provides a kind of level based on Hadoop and HBase
Clustering method, and while support tri- kinds of clusters of single-linkage, complete-linkage and average-linkage
Method, concrete technical scheme are as follows.
A kind of hierarchy clustering method based on Hadoop and HBase, it is parallel which realizes a distance matrix with Hadoop
Change and calculate, and distance matrix is stored with HBase, make full use of the sequence of HBase in the design of table using RowKey design
Function, processes the distance matrix in HBase in conjunction with multithreading and cache technology, so as to realize can apply to big number
According to, hierarchy clustering method with autgmentability.
Further, the parallelization of distance matrix is calculated and is specifically included:Realized using Hadoop, be largely divided into two
MapReduce algorithm, first MapReduce algorithm calculate distance and store the result into intermediate file, second
Intermediate file result is converted into HFile form by MapReduce algorithm;Finally result is imported to Bulk Load method
In HBase.
Further, mainly there are two tables in HBase, the wherein RowKey of a table is Cluster ID couple, and will
Cluster ID completion leading 0 keeps the same length, and the record so as to realize in table can be ranked up by Cluster ID,
Value is the distance between Cluster;The distance related to specified Cluster ID is rapidly obtained by this table;Another
The RowKey of table is the distance between Cluster, and completion leading 0 before the distance, so as to realize record in table can by away from
From being ranked up from small to large, value is Cluster ID couple;By the first trip of this table is obtained, distance can be rapidly obtained
Two nearest Cluster;
In the clustering method starting stage, pre- subregion is carried out to two tables according to Cluster ID and distance in advance, will
Subregion is shared on each node of HBase cluster.
Further, during cluster, Cluster merged all atoms that path and Cluster are included
The quantity of Cluster is recorded in internal memory;During using single-linkage or complete-linkage algorithm, new conjunction is calculated
The distance between the Cluster for becoming and existing Cluster only need to fetch the two of the new Cluster of synthesis from HBase or cache
Distance between individual Cluster and remaining Cluster is carrying out new calculating;During using average-linkage algorithm, except needing
Synthesize the distance between two Cluster of new Cluster and other Cluster, in addition it is also necessary to from internal memory, obtain each
All atom Cluster quantity that Cluster is included are carrying out the calculating of mean value.
Further, in conjunction with the multithreading algorithm of cache, from HBase obtain closest two Cluster and by it
Be merged into a new Cluster, and concurrently obtain from cache or HBase closest two Cluster with
Range information between other Cluster, starts deletion thread during this period and deletes fail data in HBase while startup calculates line
Cheng Tiqian calculates the distance obtained between the Cluster of range information and new Cluster to be believed new distance simultaneously in parallel
Breath writes back cache or HBase;Here reduce the network I/O of algorithm using cache technology.
Further, with Hadoop come computational algorithm distance matrix, distance matrix is stored with HBase, in conjunction with multi-thread
Journey technology and cache Technology design and algorithm is achieved, so as to improve the extensibility of hierarchical clustering algorithm and processing big number
According to ability.
Compared with prior art, the invention has the advantages that and technique effect:
The present invention achieves a distance matrix parallelization computational algorithm based on Hadoop, and result is converted into HFile
File is imported in HBase with Bulk Load method.HBase is used for storing distance matrix, using uniqueness in the design of table
RowKey design makes full use of the ranking function of HBase, facilitates algorithm quickly to obtain range information and obtain closest two
Individual Cluster.In combination with multithreading and cache Technology design and achieve algorithm and reserved multiple customized parameters.Logical
Cross the computation capability of Hadoop and the mass data storage capability of HBase, improve hierarchical clustering algorithm autgmentability and
Big data disposal ability.
Description of the drawings
Fig. 1 is for apart from matrix computations algorithm Organization Chart in example.
Fig. 2 is table design diagram in example.
Fig. 3 is the hierarchical tree schematic diagram of the middle-level clustering algorithm of example.
Fig. 4 is thread execution order schematic diagram.
Fig. 5 is algorithm parameter explanatory diagram in example.
Specific embodiment
In order that technical scheme and advantage become more apparent, below in conjunction with the accompanying drawings, carry out further detailed
Describe in detail bright, but the enforcement of the present invention and protection not limited to this.If it is noted that have below not especially the symbol of detailed description or
Process, is all that those skilled in the art can refer to prior art understanding or realize.
1. the parallelization computational algorithm of distance matrix
The parallelization computational algorithm of distance matrix with the calculating speed that improves distance matrix and quickly introduces HBase as mesh
Mark.Hierarchical clustering algorithm needs to rely on a space complexity for O (n during cluster2) distance matrix, in we
In method, it is that the calculating of distance matrix has devised and embodied a parallelization computational algorithm based on Hadoop, as shown in Figure 1.Should
The file for having data to be clustered is given each task as global buffer file distributing by algorithm first, then will have Cluster
The file of ID carries out piecemeal, and each task processes one piece.Each task iteration obtains each ID, and it is bigger than it with ID value to calculate which
The distance between every other Cluster, the such as Cluster of ID=2, then calculate itself and ID=3,4,5 ... Cluster it
Between distance, and by (distance ID1, ID2timestamp) form by result export to intermediate file.Space complexity
For O (n2) distance matrix, speed meeting if according to the method that inserts data item by item in reduce in HBase
Slow, Bulk Load method is employed in the implementation method quickly introduces data in HBase.Bulk Load side
It is to be stored in this principle on HDFS by specific format based on the data of HBase that method is.So in the implementation method then
The file that intermediate result is converted into other MapReduce program HFile form is realized, recycling Bulk Load method will
HFile is imported in HBase.The parallel computation of distance matrix is achieved using Hadoop and Bulk Load method and quickly lead
Enter to HBase.
2. the design of table
The design of HBase upper table has a great impact to the performance of algorithm.Irrational design will cause algorithm performance big
The big availability reduced so as to affect algorithm.Uniqueness is carried out according to the feature of hierarchical clustering algorithm to table in this implementation to set
Meter, as shown in Figure 2.Mainly there are two tables in HBase:DistanceMatrix table and sortedDistance table.
DistanceMatrix table is store by Cluster ID to being ranked up distance matrix data from small to large, and its RowKey is
Cluster ID couple, and leading for Cluster ID completion 0 RowKey length will be kept consistent, realizing record in table can be by
Cluster ID is ranked up, and value is the distance between Cluster.Can rapidly be obtained by this table and specify
The related distance of Cluster ID.SortedDistance table stores the distance matrix data for being sorted by distance from small to large,
Its RowKey is the distance between Cluster, and above completion leading 0 keeps all RowKey length consistent in distance, realization
Record in table can be ranked up from small to large by distance, and value is Cluster ID couple.By obtaining the first trip of this table,
Closest two Cluster can rapidly be obtained.Meanwhile, the realization has been reserved parameter to specify the initial of two tables
Two tables, in the algorithm starting stage, are carried out pre- subregion according to Cluster ID and distance, after subregion by region quantity in advance
Each region share on each node of HBase cluster, improve HBase oncurrent processing ability.
3. simplify and calculate
Calculate two cluster between apart from when, if two cluster are synthesized by other cluster
And come, then we need to use the distance of all-pair between two cluster, such as singleton cluster A and
Singleton cluster B has synthesized cluster D, then calculate between singleton cluster C and cluster D
Apart from when need to use cluster C and singleton cluster A and cluster C and singleton
The distance between cluster B.But need not compare in fact and every time the distance of all singleton cluster, and only
Need to compare the distance between each sub cluster under two cluster.Fig. 3 is Agglomerative hierarchical clustering algorithm
One hierarchical tree.We define dis (A, B) for the distance between cluster A and cluster B, min for minimizing, max
For maximizing, avg for averaging, count (A) for cluster A include number a little.Single- is then adopted
Linkage method has when calculating the distance between cluster 7 and cluster 9 dis (7,9):
Dis (7,9)=min (dis (1,5), dis (1,6), dis (2,5), dis (2,6), dis (3,5), dis (3,6))
=min (min (dis (1,5), dis (1,6)), min (dis (2,5), dis (2,6)), min (dis (3,5), dis
(3,6)))
=min (dis (1,7), dis (2,7), dis (3,7))
=min (dis (1,7), min (dis (2,7), dis (3,7)))
=min (dis (1,7), dis (8,7))
It can be seen that, dis (1,7), dis is only needed to when calculating the distance between cluster 7 and cluster 9 dis (7,9)
(8,7), without all distances of cluster1-4 and cluster4-5.Complete-linkage method and single-
Linkage method is similar to, it is only necessary to which the min of above formula is changed into max.And adopt average-linkage method to calculate
Have during the distance between cluster 7 and cluster 9 dis (7,9):
It can be seen that, using not only needing dis (1,7), dis (8,7) during average-linkage, in addition it is also necessary to use
The number of the point each included by cluster7and cluster9, so as to calculate mean value.
4. design and the realization of the multithreading algorithm of cache are combined
In addition to the efficient access method of distance matrix, in addition it is also necessary to which the algorithm of a parallelization is completing cluster process.
In the algorithm, we combine multithreading and cache technology to complete design and the realization of algorithm.By hierarchical clustering algorithm
Principle, it is necessary first to obtain closest two cluster, and combine them into a new cluster.As discussed above
Introduced, store in sortedDistance table cluster to and they the distance between, be most importantly they be by
Distance is sorted from small to large, so we only need to obtain first record of sortedDistance table.We utilize HBase
Scan API obtain the first data from sortedDistance table, in order to avoid obtain unnecessary data we by scan
Cache be set to 1.
For convenience of explanation, it is assumed that initial data set has 10 points, at first each o'clock separately as one
Cluster, then have 10 cluster when initial:C1 to C10.If closest two when first time iteration
Cluster is C1 and C2, and they are merged into a new cluster C11 by us.According to illustrated by a upper trifle,
We need the distance between C1, C2 and cluster C3 to C10 to calculate C11 and C3...C10, and new distance is write
In cache or HBase.In recklessly in the case of cache, it would be desirable to read distance from HBase, because this is one
The operation of I/O intensive, so we employ the mode of multithreading, concomitantly using Scan API from distanceMatric table
The middle distance for reading correlation, while we adjust the line number that Scan is read back every time by scan.setCaching.From
The process of distance is read in distanceMatric table, and we also concurrently start multiple threads from sortedDistance table
Delete the distance related to two cluster for having merged, it is clear that get and put operation is all the operation of I/O intensive type, institute
Calculated with the data that we can be in advance to having obtained, complete without waiting until all data acquisitions.So opening at us
While dynamic scan thread goes to read data in HBase, we also start computational threads and are calculated rather than waited scan thread
Terminate.And we also concurrently start multithreading and new distance are write back cache or HBase, are so obtaining all of number
According to when be also basically completed calculating task.
The execution sequence of each parallel thread is as shown in Figure 4.It is known that we need some simultaneous techniques to control
Thread.In our algorithm, we employ BlockingQueue and Barrier technology.BlockingQueue is by control
Two threads of system are alternately put into, take out the communication of first usually control line journey in BlockingQueue.And Barrier technology
Highly useful in parallel iterative algorithm, a problem is split into multiple subproblems by this algorithm, and is performed in parallel, and
Can wait until that all threads all reach column location when column location is reached.
5. adjustable parameter explanation
, during use, the parameter for having many needs to be adjusted according to practically application demand for Hadoop and HBase
Whole can be only achieved a relatively good performance.In the design process of our algorithm, except Hadoop, HBase support itself
Parameter, we have also reserved, and the self-defining parameter of many algorithms is convenient to be adjusted according to actual conditions during using algorithm
Whole, main parameter is as shown in Figure 5.Next main custom parameter will be introduced.
Than larger, we place it in the table of HBase the distance matrix produced in hierarchical clustering algorithm calculating process
And one layer of cache is increased in client.In order to make full use of the performance of the multiple machines of HBase cluster, reasonable way is
Load is shared in each machine, so we carry out pre- subregion to table when table is built, a table is split to many
Individual region, then each server is each is responsible for a part of region.We provide regionCountDM and
DistanceMatric table and sortedDistance table are divided into how many for respectively specifying that by regionCountSD parameter
region.Meanwhile, we are one layer of cache in client maintenance, to improve the performance of data search, and provide
The convenient record strip number for adjusting cache caching according to actual conditions of cacheSize parameter.Single- with hierarchical clustering algorithm
Linkage, complete-linkage and average-linkage method is corresponding, we provides similarity_method
Parameter is used for specifying using any method.While additionally providing distance_method for specifying using any distance
Calculating the distance between two points, when test, we only achieve Euclidean method to measure.Meter in algorithm
Calculation process we used multithreading, so also provide parameter can be adjusted to Thread Count, we can pass through
Adjust putThreadNum parameter to adjust put Thread Count, while pagesNum parameter can be adjusted reading from HBase to control
The data for needing to read from HBase have been carried out paging by the Thread Count of data, we, and each thread reads one page,
PagesNum state modulator number of pages, such as pagesNum=10, it would be desirable to reads the data that ID is 0 to 10000, is then divided into 10
Individual thread, reads 0 to 1000,1001 to 2000 by that analogy respectively.Finally, for the end bar of control hierarchy clustering algorithm
Data can be polymerized to how many classes by arranging maxClusterNum parameter to specify, it is also possible to by arranging by part
MinDistance parameter terminates when the minimum range between two cluster is less than the value that minDistance is specified to specify
Cluster.
Claims (5)
1. a kind of hierarchy clustering method based on Hadoop and HBase, it is characterised in that realize with Hadoop apart from square
Battle array parallelization is calculated, and stores distance matrix with HBase, makes full use of HBase using RowKey design in the design of table
Ranking function, process the distance matrix in HBase in conjunction with multithreading and cache technology, so as to realize applying
Hierarchy clustering method in big data, with autgmentability.
2. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that distance matrix
Parallelization is calculated and is specifically included:Realized using Hadoop, be largely divided into two MapReduce algorithms, first MapReduce
Algorithm calculates distance and stores the result into intermediate file, and intermediate file result is converted into by second MapReduce algorithm
HFile form;Finally result is imported in HBase with Bulk Load method.
3. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that main in HBase
There are two tables, the wherein RowKey of a table is Cluster ID couple, and as leading for Cluster ID completion 0 is kept
Length, the record so as to realize in table can be ranked up by Cluster ID, and value is the distance between Cluster;By this
Table rapidly obtains the distance related to specified Cluster ID;The RowKey of another table is the distance between Cluster, and
Completion leading 0 before distance, the record so as to realize in table can be ranked up from small to large by distance, and value is Cluster
ID pair;By the first trip of this table is obtained, closest two Cluster can be rapidly obtained;
In the clustering method starting stage, pre- subregion is carried out to two tables according to Cluster ID and distance in advance, by subregion
Share on each node of HBase cluster.
4. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that:In cluster
During, the quantity for merging all atom Cluster that path and Cluster are included of Cluster is recorded in internal memory;
During using single-linkage or complete-linkage algorithm, calculate new synthesis Cluster and existing Cluster it
Between distance only need to fetch from HBase or cache between two Cluster and remaining Cluster for synthesizing new Cluster
Distance is carrying out new calculating;During using average-linkage algorithm, except needing to synthesize two of new Cluster
Distance between Cluster and other Cluster, in addition it is also necessary to obtain all atoms that each Cluster is included from internal memory
Cluster quantity is carrying out the calculating of mean value.
5. the hierarchy clustering method based on Hadoop and HBase according to claim 1, it is characterised in that combine cache
Multithreading algorithm, obtain closest two Cluster from HBase and combine them into a new Cluster, and
Range information between closest two Cluster and other Cluster is concurrently obtained from cache or HBase,
Start during this fail data in thread deletion HBase is deleted while startup computational threads are calculated in advance and obtained range information
Cluster and new Cluster between distance simultaneously in parallel new range information is write back cache or HBase.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610851970.1A CN106484818B (en) | 2016-09-26 | 2016-09-26 | Hierarchical clustering method based on Hadoop and HBase |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610851970.1A CN106484818B (en) | 2016-09-26 | 2016-09-26 | Hierarchical clustering method based on Hadoop and HBase |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484818A true CN106484818A (en) | 2017-03-08 |
CN106484818B CN106484818B (en) | 2023-04-28 |
Family
ID=58268853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610851970.1A Active CN106484818B (en) | 2016-09-26 | 2016-09-26 | Hierarchical clustering method based on Hadoop and HBase |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484818B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106932184A (en) * | 2017-03-15 | 2017-07-07 | 国网四川省电力公司广安供电公司 | A kind of Diagnosis Method of Transformer Faults based on improvement hierarchical clustering |
CN112668622A (en) * | 2020-12-22 | 2021-04-16 | 中国矿业大学(北京) | Analysis method and analysis and calculation device for coal geological composition data |
CN113268333A (en) * | 2021-06-21 | 2021-08-17 | 成都深思科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core calculation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965823A (en) * | 2015-07-30 | 2015-10-07 | 成都鼎智汇科技有限公司 | Big data based opinion extraction method |
-
2016
- 2016-09-26 CN CN201610851970.1A patent/CN106484818B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965823A (en) * | 2015-07-30 | 2015-10-07 | 成都鼎智汇科技有限公司 | Big data based opinion extraction method |
Non-Patent Citations (1)
Title |
---|
徐小龙;李永萍;: "一种基于MapReduce的知识聚类与统计机制" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106932184A (en) * | 2017-03-15 | 2017-07-07 | 国网四川省电力公司广安供电公司 | A kind of Diagnosis Method of Transformer Faults based on improvement hierarchical clustering |
CN112668622A (en) * | 2020-12-22 | 2021-04-16 | 中国矿业大学(北京) | Analysis method and analysis and calculation device for coal geological composition data |
CN113268333A (en) * | 2021-06-21 | 2021-08-17 | 成都深思科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core calculation |
CN113268333B (en) * | 2021-06-21 | 2024-03-19 | 成都锋卫科技有限公司 | Hierarchical clustering algorithm optimization method based on multi-core computing |
Also Published As
Publication number | Publication date |
---|---|
CN106484818B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10789231B2 (en) | Spatial indexing for distributed storage using local indexes | |
Liao et al. | Multi-dimensional index on hadoop distributed file system | |
CN104899297B (en) | Create the method with the hybrid index of storage perception | |
CN104407879B (en) | A kind of power network sequential big data loaded in parallel method | |
US20110252033A1 (en) | System and method for multithreaded text indexing for next generation multi-core architectures | |
WO2012060889A1 (en) | Systems and methods for grouped request execution | |
US7890480B2 (en) | Processing of deterministic user-defined functions using multiple corresponding hash tables | |
CN103745008A (en) | Sorting method for big data indexing | |
CN104376109B (en) | A kind of multi-dimensional data location mode based on data distribution library | |
CN107918642A (en) | Data query method, server and computer-readable recording medium | |
CN106484818A (en) | A kind of hierarchy clustering method based on Hadoop and HBase | |
CN103207889A (en) | Method for retrieving massive face images based on Hadoop | |
CN107203330A (en) | A kind of flash data location mode towards read-write data flow | |
CN111857582B (en) | Key value storage system | |
Ribeiro-Junior et al. | Fast parallel set similarity joins on many-core architectures | |
US20060149766A1 (en) | Method and an apparatus to improve processor utilization in data mining | |
CN109460406A (en) | Data processing method and device | |
CN104765782B (en) | A kind of index order update method and device | |
CN102521304A (en) | Hash based clustered table storage method | |
CN110008030A (en) | A kind of method of metadata access, system and equipment | |
CN108052535A (en) | The parallel fast matching method of visual signature and system based on multi processor platform | |
CN104268146A (en) | Static B+-tree index method suitable for analytic applications | |
CN106294526B (en) | A kind of mass small documents moving method in hierarchical stor | |
CN109977334B (en) | Search speed optimization method | |
Kim et al. | A performance study of traversing spatial indexing structures in parallel on GPU |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |