CN115658809A - Data distributed clustering method and device based on local direction centrality - Google Patents

Data distributed clustering method and device based on local direction centrality Download PDF

Info

Publication number
CN115658809A
CN115658809A CN202211265216.1A CN202211265216A CN115658809A CN 115658809 A CN115658809 A CN 115658809A CN 202211265216 A CN202211265216 A CN 202211265216A CN 115658809 A CN115658809 A CN 115658809A
Authority
CN
China
Prior art keywords
data
cluster
point
partition
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211265216.1A
Other languages
Chinese (zh)
Inventor
桂志鹏
黄子晨
彭德华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202211265216.1A priority Critical patent/CN115658809A/en
Publication of CN115658809A publication Critical patent/CN115658809A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data distributed clustering method and device based on local direction centrality, wherein the method comprises the following steps: s1, submitting parameters required by an algorithm task in a distributed cluster environment, and reading data to be clustered; s2, constructing a global index of a priority search K-means tree based on complete data, and sharing index variables to all working nodes of the cluster; s3, dividing complete data by combining a data sampling and Hilbert curve partitioning method; s4, executing CDC local clustering on each working node in parallel; s5, carrying out inter-partition class cluster combination according to the maximum reachable distance of the local class clusters to generate complete class clusters; and S6, outputting the clustering result to a distributed file system. The method performs distributed optimization and acceleration on the CDC clustering algorithm from two aspects of algorithm flow optimization and parallel processing optimization, aims to improve the calculation efficiency of the CDC algorithm, and provides a feasible optimization scheme for the application of the algorithm in massive data mining and machine learning tasks.

Description

Data distributed clustering method and device based on local direction centrality
Technical Field
The invention relates to the technical field of big data mining, in particular to a data distributed clustering method and device based on local direction centrality.
Background
In recent years, a great deal of clustering algorithm research has already provided effective solutions for the problems of arbitrary shape cluster identification, abnormal value detection, high-dimensional data processing and the like, but the density heterogeneity and weak connectivity of data distribution still exist in the application scene of clustering analysis but are still pending. Since the inner point of the cluster tends to be surrounded by its neighbor points in all directions, and the boundary point has neighbor points only in a certain direction range, the inner point and the boundary point can be divided according to the distribution difference of the neighbor directions. Accordingly, the local Direction Centrality clustering algorithm CDC measures the directional uniformity of the distribution of the K Nearest Neighbor (KNN) of the data point K by establishing a local Direction Centrality Measure (DCM), and realizes the division of the cluster-like internal point and the boundary point in a density-independent manner; meanwhile, the boundary points are utilized to restrain the connection of the internal points, cross-cluster communication can be avoided, separation of weak connection clusters is achieved, and an effective scheme is provided for solving the problems. The accuracy of the algorithm has been verified in artificial and real data sets, but the neighbor search has O (n) 2 ) The level time complexity is obviously reduced along with the increase of the data scale, even the situation that a single computer cannot calculate occurs, and the data scale which is exponentially increased nowadays cannot be met. For the above problems, besides the need of improving the flow to reduce the time complexity of the algorithm itself, the distributed computing efficiency of the clustering algorithm can be improved from the perspective of parallel computing.
Parallelization has become a hot spot for performance optimization of the current clustering algorithm. The common distributed computing framework comprises Hadoop, spark, flink and the like, wherein Spark is a new generation of big data parallel processing platform and has the advantages of simplicity, easiness in use, rich functions, automatic fault tolerance and the like. Compared with a classic big data parallel processing platform Hadoop, spark data management based on the memory makes the clustering algorithm more suitable for needing multiple rounds of iteration. A plurality of Spark-based clustering algorithm parallelization schemes have been proposed in the prior art. The research improves the efficiency of big data clustering to a certain extent by designing the algorithm into three stages of data partitioning, distributed local clustering and global merging; however, the Spark default partitioning strategy easily causes unbalanced partition data load due to neglecting spatial proximity of the cluster. When data in the partition is inclined, unbalanced workload of the nodes in the Shuffle stage can be caused, that is, the difference of data processing amount of each node in the cluster is large, and execution time is inconsistent, so that the utilization rate of cluster resources and the calculation efficiency of a distributed algorithm can be reduced.
Therefore, the technical problems of low calculation efficiency and poor partitioning effect exist in the prior art.
Disclosure of Invention
The invention provides a data distributed clustering method and device based on local direction centrality clustering, which are used for solving or at least partially solving the technical problems of low calculation efficiency and poor partitioning effect in the prior art.
In order to solve the above technical problem, a first aspect of the present invention provides a data distributed clustering method based on local direction centrality, including:
s1: receiving parameters required by a clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering a sequencer, and reading complete data to be clustered from a distributed file system;
s2: constructing a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of a distributed cluster;
s3: partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to a corresponding working node through a main node of the distributed cluster;
s4: each working node of the distributed cluster executes the CDC local clustering algorithm in parallel, and the method specifically comprises the following steps: respectively carrying out k nearest neighbor search on the partitioned data and calculating a DCM value by the working node through a shared global index, dividing an internal point and a boundary point according to the relation between the DCM value and a DCM threshold value, merging the internal points based on the reachable distance between the internal point and the boundary point, classifying the merged internal points into the same internal point cluster, marking an internal point cluster ID, searching the internal point most adjacent to the boundary point and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
s5: the master node of the distributed cluster merges the partial cluster among the partitions according to the maximum reachable distance of the partial cluster to generate a complete cluster as a clustering result;
s6: and outputting the clustering result to a distributed file system.
In one embodiment, step S1 comprises:
s1.1: the distributed cluster receives parameters required by a clustering task, wherein the environment parameters comprise file paths, the clustering algorithm parameters comprise neighbor numbers and boundary point proportions, the partition parameters comprise partition types, partition sampling rate proportions and partition numbers, and the neighbor search parameters comprise index type parameters, construction parameters and search parameters;
s1.2: a sequencer for registering the geometric type object and the index;
s1.3: and reading the complete data to be clustered in the distributed file system according to the file path, and performing projection conversion.
In one embodiment, step S2 comprises:
s2.1: initializing an index structure according to index type parameters and construction parameters, wherein the construction parameters comprise branch factors branch and K-means maximum iteration times I max And initial centroid selection method C alg
S2.2: calculating the centroid of the complete data to be clustered, and constructing a root node of the index tree;
s2.3: according to C alg Selecting branch initial partition centroids, and dividing data into the nearest partitions;
s2.4: updating the partition centroid and re-partitioning the data until the partition centroid is unchanged or the update reaches I max
S2.5: constructing nodes according to the partition centroids, and adding the nodes to a child node set of a father node;
s2.6: repeating the steps S2.3 to S2.5 until the number of data in the partition is less than that of the branch, obtaining a constructed global index of the prior search K-means tree, and expressing the global index by using a variable;
s2.7: and distributing the global index variable of the priority search K-means tree to each working node through the main node of the distributed cluster.
In one embodiment, step S3 comprises:
s3.1: carrying out data sampling on the complete data to be clustered according to the partition sampling rate proportion;
s3.2: calculating Hilbert coding values of the sampling data and sequencing sampling points according to the values;
s3.3: uniformly dividing the sampling points into intervals with the number corresponding to the number of the partitions, and recording division positions as the partitions;
s3.4: a rectangular partition range is formed by expanding sampling points, and partition IDs of all data are generated;
s3.5: and distributing the corresponding partition data to each working node of the cluster according to the partition ID.
In one embodiment, step S4 comprises:
s4.1: performing K-nearest neighbor search on the data to be clustered in each partition based on a global index structure of a prior search K-means tree to obtain K neighbor points of the data points;
s4.2: calculating an angle variance DCM formed by each data point and k neighbor points thereof in a two-dimensional space;
Figure BDA0003892828650000031
wherein k is the number of neighboring points of the data point, (theta) 12 ,…,θ k ) Is the angle formed by the data point and the neighbor point and satisfies
Figure BDA0003892828650000032
S4.3: merging DCM value results of all data points of the partitions, sorting the DCM value results, and then sorting the data points according to boundary pointsCalculating the Ratio parameter Ratio to obtain a threshold value T DCM If DCM is less than T DCM Marking the data points as interior points, otherwise marking the data points as boundary points;
s4.4: computing an interior point p i And all boundary points q m The minimum distance between as the inner point p i Can reach a distance r i I.e. r i =min(d(p i ,q m ) ); wherein p is i Is an interior point, q m As boundary point, d (p) i ,q m ) Is an inner point p i And the boundary point q m The distance between them;
s4.5: merging the interior points according to a connection rule that the distance between the two interior points is not more than the sum of the two-point reachable distances and marking the cluster ID of the interior points, namely d (p) i ,p j )≤r i +r j (ii) a Wherein r is i 、r j Respectively an inner point p i Inner point p j The reachable distance of d (p) i ,p j ) Is an inner point p i And an inner point p j The distance between them;
s4.6: searching the internal point nearest to the boundary point, and taking the class cluster ID where the searched internal point is located as the boundary point class cluster ID to obtain the local class cluster.
In one embodiment, step S5 comprises:
for local cluster C α The reachable distance of each internal point is sequenced to obtain the maximum reachable distance R α =max(r i );
Performing inter-partition cluster merging according to the maximum reachable distance and the connection rule that the distance between the two clusters is not more than the sum of the reachable distances of the two clusters, namely D (C) a ,C β )≤R α +R β And updating the class cluster ID to generate a complete class cluster, wherein C a 、C β Are two different local clusters, R α 、R β Are respectively local cluster class C a 、C β The maximum reachable distance of D (C) a ,C β ) Is C a 、C β The distance between the inner points of the maximum reachable distance is taken.
In one embodiment, after step S6, the method further comprises: and outputting the clustering evaluation result and the calculation time result to a distributed file system.
Based on the same inventive concept, a second aspect of the present invention provides a data distributed clustering apparatus based on local direction centrality, including:
the initialization module is used for receiving parameters required by the clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering the sequencer, and reading complete data to be clustered from the distributed file system;
the global index building module is used for building a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of the distributed cluster;
the data partitioning module is used for partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to corresponding working nodes through main nodes of the distributed cluster;
the local clustering module is used for executing a CDC local clustering algorithm in parallel through each working node of the distributed cluster, and specifically comprises: respectively carrying out k nearest neighbor search on partition data by using a global index shared by the master nodes, calculating a DCM value, dividing internal points and boundary points according to the relation between the DCM value and a DCM threshold value, merging the internal points based on the reachable distance from the internal points to the boundary points, classifying the merged internal points into the same internal point cluster, marking an internal point cluster ID, searching the internal points closest to the boundary points and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
the global merging module is used for merging the local clusters in a partition mode through the main node of the distributed cluster according to the maximum reachable distance of the local clusters to generate complete clusters serving as clustering results;
and the result output module is used for outputting the clustering result to the distributed file system.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.
Compared with the prior art, the invention has the advantages and beneficial technical effects as follows:
the invention provides a data distributed clustering method based on local direction centrality, which realizes data clustering in a distributed cluster environment by using a local direction centrality clustering algorithm, improves the efficiency of neighbor search in local clustering based on a priority search K-means tree index, and optimizes data partitioning in a parallel processing flow based on a data sampling and Hilbert curve partitioning method. The K-means tree index is preferentially searched to convert the distance calculation of the point pair into the neighborhood query of the node to accelerate the search of the neighborhood point, so that the query range can be narrowed to improve the efficiency of local clustering; in addition, the partitioning method combining data sampling and the Hilbert curve considers the partitioning efficiency and the spatial proximity of data distribution, not only is the partitioning speed improved by reducing the data amount of partitioning calculation through data sampling, but also the balanced partitioning is constructed by combining the Hilbert curve with better spatial aggregation characteristics so as to improve the performance of the parallel algorithm. The method can be applied to massive two-dimensional point data, has good parallel acceleration effect and expandability on the premise of keeping clustering precision, is expected to improve the calculability of the algorithm facing the massive data and the utilization rate of the calculation resources of a distributed system, and provides a feasible optimization scheme for various kinds of big data mining and machine learning application of the CDC clustering algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of an artificial data set provided by an embodiment of the invention;
FIG. 2 is a graph comparing the clustering accuracy of the clustering method of the present invention with that of the prior serial clustering method;
FIG. 3 is a graph comparing the execution times of the clustering method of the present invention and the prior serial clustering method;
FIG. 4 is a flow chart of a Spark-based distributed local direction centrality clustering algorithm in an embodiment of the present invention;
FIG. 5 is a schematic diagram of an algorithm for constructing a preferential search K-means tree according to an embodiment of the present invention;
FIG. 6 is a flow chart of a data partitioning algorithm in an embodiment of the present invention;
FIG. 7 is a schematic diagram of a local clustering algorithm in an embodiment of the present invention;
FIG. 8 is a schematic diagram of the query-first search K-means tree algorithm in an embodiment of the present invention;
FIG. 9 is a diagram illustrating a partition global merge algorithm according to an embodiment of the present invention;
FIG. 10 is a flowchart of a data distributed clustering method based on local direction centrality according to an embodiment of the present invention;
fig. 11 is a frame diagram of a data distributed clustering apparatus based on local direction centrality according to an embodiment of the present invention;
FIG. 12 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention;
FIG. 13 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 14 is a schematic diagram of a computer distributed architecture according to an embodiment of the present invention.
Detailed Description
The technical problem to be solved by the invention is how to accelerate neighbor search in the algorithm flow to realize more efficient data point division aiming at the problems of low calculation efficiency and data tilt of Spark default partition strategy when the CDC clustering algorithm processes mass data, and a fast and balanced data partition method is designed to improve the overall performance of a parallel distributed system and the efficiency of the CDC clustering algorithm.
Specifically, the method takes the CDC clustering algorithm as a research object, takes a distributed computing frame as a technical support, and aims at solving the problems of intensive computation and slant Spark default partition data faced by the CDC clustering algorithm in a mass data scene, establishes a preferential search K-means tree index to accelerate nearest neighbor search, and reduces the computation complexity of the algorithm; and the space partition is optimized by combining the data sampling and Hilbert curve partition methods, so that the problems of low partition efficiency, partition data inclination, high cross-node communication cost and the like are solved. On the basis, search acceleration and partition optimization design are combined, and a Spark-based two-dimensional parallel CDC clustering algorithm is realized, so that the requirement of the CDC clustering algorithm for processing the application scene of mass data is met.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example one
The embodiment of the invention provides a data distributed clustering method based on local direction centrality, which comprises the following steps:
s1: receiving parameters required by a clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering a sequencer, and reading complete data to be clustered from a distributed file system;
s2: constructing a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of a distributed cluster;
s3: partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to a corresponding working node through a main node of the distributed cluster;
s4: each working node of the distributed cluster executes the CDC local clustering algorithm in parallel, and the CDC local clustering algorithm specifically comprises the following steps: respectively carrying out k nearest neighbor search on partition data and calculating a DCM value by a working node through a shared global index, dividing an internal point and a boundary point according to the relation between the DCM value and a DCM threshold value, merging the internal point based on the reachable distance from the internal point to the boundary point, classifying the merged internal point into the same internal point cluster, marking an internal point cluster ID, searching the internal point closest to the boundary point and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
s5: the main node of the distributed cluster merges the local clusters in a partition mode according to the maximum reachable distance of the local clusters to generate complete clusters serving as clustering results;
s6: and outputting the clustering result to a distributed file system.
Specifically, the step S2 of building the global index of the priority search K-means tree can accelerate the neighbor search step in the CDC local clustering algorithm in the step S4, so that the calculation efficiency is improved.
Step S3 is a data partitioning stage, firstly, reducing the data amount of partitioning calculation through random sampling, then, uniformly partitioning data through calculating Hilbert coding values of sampling points, and finally, partitioning all data according to the partitioning of the sampling points; and after the partition is completed, distributing the partition data to each working node of the cluster.
Step S4 is a local clustering stage, and the CDC comprises three steps of neighbor searching, data division and internal point connection.
And step S5, a global merging stage, namely merging the local cluster in each partition according to the maximum reachable distance of the local cluster, and updating the ID of the cluster at the same time to generate a complete cluster.
And S6, outputting the result, and outputting the clustering result to the distributed file system.
The scheme provides a distributed optimization and acceleration solution for the CDC clustering algorithm, which has high time complexity and is difficult to process large-scale data. The scheme aims to obtain good parallel acceleration effect and expandability on the premise of keeping the clustering precision of the algorithm, so that the calculability of the algorithm facing mass data and the utilization rate of computing resources of a distributed system are expected to be improved, namely the calculability problem of a CDC clustering algorithm under a mass data scene and the data tilt problem of an Apache Spark primary data partitioning scheme.
In one embodiment, step S1 comprises:
s1.1: the distributed cluster receives parameters required by a clustering task, wherein environment parameters comprise file paths, clustering algorithm parameters comprise neighbor numbers and boundary point proportions, partition parameters comprise partition types, partition sampling rate proportions and partition numbers, and neighbor search parameters comprise index type parameters, construction parameters and search parameters;
s1.2: a sequencer for registering the geometric type object and the index;
s1.3: and reading the complete data to be clustered in the distributed file system according to the file path, and performing projection conversion.
In one embodiment, step S2 comprises:
s2.1: initializing an index structure according to index type parameters and construction parameters, wherein the construction parameters comprise branch factors branch and K-means maximum iteration times I max And initial centroid selection method C alg
S2.2: calculating the centroid of the complete data to be clustered, and constructing a root node of the index tree;
s2.3: according to C alg Selecting branch initial partition centroids, and dividing data into the nearest partitions;
s2.4: updating the partition centroid and repartitioning the data until the partition centroid is unchanged or the update reaches I max
S2.5: constructing nodes according to the partition centroids, and adding the nodes to a child node set of a father node;
s2.6: repeating the steps S2.3 to S2.5 until the number of data in the partition is less than that of the branch, obtaining a constructed global index of the prior search K-means tree, and expressing the global index by using variables;
s2.7: and distributing the global index variable of the priority search K-means tree to each working node through the main node of the distributed cluster.
In one embodiment, step S3 comprises:
s3.1: carrying out data sampling on the complete data to be clustered according to the partition sampling rate proportion;
s3.2: calculating Hilbert coding values of the sampling data and sequencing sampling points according to the values;
s3.3: uniformly dividing the sampling points into intervals with the number corresponding to the number of the partitions, and recording division positions as the partitions;
s3.4: a rectangular partition range is formed by expanding sampling points, and partition IDs of all data are generated;
s3.5: and distributing the corresponding partition data to each working node of the cluster according to the partition ID.
In one embodiment, step S4 comprises:
s4.1: performing K neighbor search on the data to be clustered in each partition based on a global index structure of a prior search K-means tree to obtain K neighbor points of the data points;
s4.2: calculating the angle variance DCM formed by each data point and k neighbor points thereof in a two-dimensional space;
Figure BDA0003892828650000081
wherein k is the number of neighboring points of the data point, (theta) 12 ,…,θ k ) Is the angle formed by the data point and the neighbor point and satisfies
Figure BDA0003892828650000082
S4.3: merging DCM value results of all data points of the partitions, sorting the DCM value results, and then sorting the DCM value results according to boundary pointsCalculating the Ratio parameter to obtain a threshold value T DCM If DCM is less than T DCM Marking the data points as interior points, otherwise marking the data points as boundary points;
s4.4: computing an interior point p i With all boundary points q m The minimum distance between as the inner point p i Is a reachable distance r i I.e. r i =min(d(p i ,q m ) ); wherein p is i Is an interior point, q m As boundary point, d (p) i ,q m ) Is an inner point p i And the boundary point q m The distance between them;
s4.5: merging interior points and marking interior point cluster IDs, namely d (p) according to a connection rule that the distance between two interior points is not more than the sum of the reachable distances of the two points i ,p j )≤r i +r j (ii) a Wherein r is i 、r j Respectively an inner point p i Inner point p j D (p) of i ,p j ) Is an inner point p i And an inner point p j The distance between them;
s4.6: searching the internal point nearest to the boundary point, and taking the class cluster ID where the searched internal point is located as the boundary point class cluster ID to obtain the local class cluster.
In particular, the reachable distance is the minimum distance between the interior point and all boundary points, or the distance from the interior point to the nearest boundary point, the interior point p i Internal point p j Are distinct interior points.
In one embodiment, step S5 comprises:
for local cluster C α The reachable distance of each internal point in the sequence is obtained, and the maximum reachable distance R is obtained α =max(r i );
Performing inter-partition cluster merging according to the maximum reachable distance and the connection rule that the distance between the two clusters is not more than the sum of the reachable distances of the two clusters, namely D (C) a ,C β )≤R α +R β And updating the class cluster ID to generate a complete class cluster, wherein C a 、C β Are two different local clusters, R α 、R β Are respectively local cluster class C a 、C β Maximum reachable distance of D (C) a ,C β ) Is C a 、C β The distance between the inner points of the maximum reachable distance is taken.
The two cluster reachable distances refer to the reachable distance of each internal point in the local cluster, and the maximum reachable distance is used for measuring the boundary which can be reached by the local cluster, and the principle is the same as that of the internal point.
In one embodiment, after step S6, the method further comprises: and outputting the clustering evaluation result and the calculation time result to a distributed file system.
Please refer to fig. 10, which is a detailed flowchart of a clustering method according to an embodiment of the present invention.
The invention discloses a Local Direction Centrality Clustering algorithm (CDC) distributed optimization and acceleration method based on preferential search K-means tree neighbor search and based on data sampling and Hilbert curve data partitioning, which comprises the following steps: s1, submitting parameters required by an algorithm task in a distributed cluster environment, configuring and registering a sequencer, and reading data to be clustered from a distributed file system; s2, constructing a global index of a priority search K-means tree based on complete data, and sharing index variables to all working nodes of the cluster; s3, dividing the complete data by combining a data sampling and Hilbert curve partitioning method, and distributing the partitioned data to each working node of the cluster; s4, performing CDC local clustering on each working node in parallel, wherein the CDC local clustering comprises three steps of neighbor searching, data partitioning and internal point connection, namely, performing K nearest neighbor searching on data to be clustered in each partition respectively based on a global index, calculating a local Direction Centrality Metric value (direct Central Metric, DCM) on each data point, partitioning internal points and boundary points according to set DCM threshold parameters, merging other internal points and marking internal point cluster IDs based on the reachable distance from the internal points to the boundary points, searching the internal points closest to the boundary points and marking the boundary point cluster IDs, and obtaining local cluster; s5, carrying out inter-partition cluster merging according to the maximum reachable distance of the local cluster, and updating the cluster ID to generate a complete cluster; and S6, outputting the clustering result to a distributed file system. Aiming at the computability problem of the CDC clustering algorithm in the data mass data scene and the data inclination problem of the Apache Spark native data partitioning scheme, the distributed optimization and acceleration are carried out on the CDC clustering algorithm from two angles of algorithm flow optimization and parallel processing optimization. In the aspect of algorithm flow optimization, the nearest neighbor search is optimized by constructing a preferential search K-means tree index, so that the computational complexity of the CDC clustering algorithm is reduced; in the aspect of parallel processing optimization, a Hilbert curve is used for establishing a spatial partition for sampling data, and the problems of low construction efficiency of a default partition based on complete data, data inclination, high communication cost of cross-node data transmission and the like are solved. The method aims to improve the calculation efficiency of the CDC algorithm and provides a feasible optimization scheme for the application of the algorithm in massive data mining and machine learning tasks.
In order to more clearly illustrate the beneficial effects of the technical scheme disclosed by the invention, in the specific embodiment, 6 two-dimensional artificial data sets (shown in fig. 1) with different data distributions are selected to carry out an experiment, and the clustering evaluation indexes of the serial and parallel optimization algorithms are calculated to verify the clustering accuracy of the invention. Setting the range of an experimental algorithm parameter k to be 5-50, and setting the value interval to be 5; the Ratio range is set to be 0.05-0.4, and the value interval is 0.05; the number of partitions of the parallel algorithm is set to 4. After finishing the clustering calculation of all parameter combinations, clustering results with the best performance of the cluster evaluation indexes by adopting a serial algorithm and a parallel algorithm are respectively adopted to calculate the clustering precision. The experimental result shows that the invention maintains better clustering accuracy compared with the serial CDC as shown in FIG. 2.
The national POI data of the high-grade map are selected, and the data set is divided into 10-100 ten thousand different scale levels for experiment, so that the parallel acceleration effect of the method is verified. Setting the range of a clustering algorithm parameter k to be 30-50; the Ratio value interval is 10, the range is set to be 0.05-0.3, and the value interval is 0.05. The neighbor search parameter K is set to 32 and imax is set to 11. The partition sampling rate is set to 1%, and the values of the number of partitions are set to 16, 32, 64 and 128. In order to ensure the reliability of the experimental result, the execution time is the average value of the execution times of different parameter combinations of the algorithm. The experimental result shows that the performance of the method is remarkably improved relative to the serial CDC algorithm, the execution time of the method is approximately linearly increased along with the increase of the scale of the data set, the execution time is gradually increased from 14.40 seconds under the scale of 10 ten thousand to 1091.96 seconds under the scale of 100 ten thousand, and the increase speed is remarkably slowed down as shown in FIG. 3.
Fig. 2 and fig. 3 also compare the original single-machine algorithm with the parallel acceleration algorithm provided by the present application from the perspective of clustering accuracy and execution time on the real data set, verify the feasibility and high efficiency of the method of the present application, and have significant progress.
The implementation process will be described with Apache Spark as an example, the test stand-alone configuration is a 4-core 8-thread 3.40GHz CPU and a 16G memory, the operating system is Windows, and the calculation flow is shown in fig. 4.
According to the invention, the computability of the algorithm on large-scale data is improved through a distributed optimization and acceleration method of the local direction centrality clustering algorithm, meanwhile, the utilization rate of the algorithm on computing resources of a distributed system and the application potential of the distributed system on spatial data are improved, and various data mining applications are assisted.
The algorithm process of the present invention will be described in detail below with reference to the accompanying drawings, and the specific steps are as follows:
after the environment parameters, the clustering algorithm parameters, the partition parameters and the neighbor search parameters are submitted to the master node, reading data to be clustered stored in the HDFS;
constructing a global index structure based on a priority search K-means tree on data to be clustered on a main node, wherein the global index structure is shown in figure 5 and is broadcasted to a Spark cluster;
constructing a space partition for the sampling points by combining the data sampling and Hilbert curve partition methods as shown in FIG. 6, and distributing data to the working nodes according to the partition;
performing a CDC local clustering algorithm in parallel on each working node as shown in FIG. 7, including the steps of neighbor searching, data partitioning, and interior point connecting, that is, in each partition, performing KNN neighbor searching on data to be clustered on a global index structure as shown in FIG. 8, calculating a DCM value, then partitioning interior points and boundary points according to a threshold value of DCM, finally merging the interior points and marking an interior point cluster ID based on an reachable distance from the interior points to the boundary points, searching the most adjacent interior points of the boundary points and marking a boundary point cluster ID, and obtaining a local cluster;
performing inter-partition class cluster merging on the master node according to the maximum reachable distance of the local class clusters as shown in fig. 9, and updating class cluster IDs at the same time to generate complete class clusters;
and outputting the clustering result to the HDFS.
Example two
Based on the same inventive concept, the present embodiment provides a data distributed clustering apparatus based on local direction centrality, including:
the system comprises an initialization module 1, a clustering module and a clustering module, wherein the initialization module is used for receiving parameters required by a clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering a sequencer, and reading complete data to be clustered from a distributed file system;
the global index construction module 2 is used for constructing a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of a distributed cluster;
the data partitioning module 3 is used for partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to a corresponding working node through a main node of the distributed cluster;
the local clustering module 4 is configured to execute a CDC local clustering algorithm in parallel through each working node of the distributed cluster, and specifically includes: respectively carrying out k nearest neighbor search on partition data and calculating a DCM value by a working node through a shared global index, dividing an internal point and a boundary point according to the relation between the DCM value and a DCM threshold value, merging the internal point based on the reachable distance from the internal point to the boundary point, classifying the merged internal point into the same internal point cluster, marking an internal point cluster ID, searching the internal point closest to the boundary point and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
the global merging module 5 is used for merging the local cluster among the partitions through the master node of the distributed cluster according to the maximum reachable distance of the local cluster to generate a complete cluster as a clustering result;
and the result output module 6 is used for outputting the clustering result to the distributed file system.
Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the data distributed clustering method based on the local direction centrality in the first embodiment of the present invention, as shown in fig. 11, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus, and thus, no further description is provided herein. All the devices adopted in the method in the first embodiment of the invention belong to the protection scope of the invention.
EXAMPLE III
Based on the same inventive concept, please refer to fig. 12, the present invention further provides a computer readable storage medium 300, on which a computer program 311 is stored, which when executed implements the method as described in the first embodiment.
Since the computer-readable storage medium introduced in the third embodiment of the present invention is a computer-readable storage medium used for implementing the data distributed clustering method based on the local direction centrality in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the computer-readable storage medium, and thus, details are not described herein. Any computer readable storage medium used in the method of the first embodiment of the present invention falls within the intended scope of the present invention.
Example four
Based on the same inventive concept, the present application further provides a computer device, as shown in fig. 13, including a memory 401, a processor 402, and a computer program 403 stored on the memory and executable on the processor, where the processor implements the method in the first embodiment when executing the above program.
In a specific implementation process, an implementation framework in the computer-readable storage medium or the computer device of the present invention is a distributed architecture, and specifically, as shown in fig. 14, the distributed architecture is a master-slave distributed structure including a distributed file system, a master node, and a work node.
Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the data distributed clustering method based on the local direction centrality in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, persons skilled in the art can understand the specific structure and deformation of the computer device, and thus details are not described here. All the computer devices used in the method of the first embodiment of the present invention are within the scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A data distributed clustering method based on local direction centrality is characterized by comprising the following steps:
s1: receiving parameters required by a clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering a sequencer, and reading complete data to be clustered from a distributed file system;
s2: constructing a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of a distributed cluster;
s3: partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to a corresponding working node through a main node of the distributed cluster;
s4: each working node of the distributed cluster executes the CDC local clustering algorithm in parallel, and the method specifically comprises the following steps: respectively carrying out k nearest neighbor search on partition data and calculating a DCM value by a working node through a shared global index, dividing an internal point and a boundary point according to the relation between the DCM value and a DCM threshold value, merging the internal point based on the reachable distance from the internal point to the boundary point, classifying the merged internal point into the same internal point cluster, marking an internal point cluster ID, searching the internal point closest to the boundary point and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
s5: the master node of the distributed cluster merges the partial cluster among the partitions according to the maximum reachable distance of the partial cluster to generate a complete cluster as a clustering result;
s6: and outputting the clustering result to a distributed file system.
2. The distributed data clustering method based on the local direction centrality as claimed in claim 1, wherein the step S1 includes:
s1.1: the distributed cluster receives parameters required by a clustering task, wherein environment parameters comprise file paths, clustering algorithm parameters comprise neighbor numbers and boundary point proportions, partition parameters comprise partition types, partition sampling rate proportions and partition numbers, and neighbor search parameters comprise index type parameters, construction parameters and search parameters;
s1.2: a sequencer for registering the geometric type object and the index;
s1.3: and reading the complete data to be clustered in the distributed file system according to the file path, and performing projection conversion.
3. The distributed data clustering method based on the local direction centrality as claimed in claim 2, wherein the step S2 comprises:
s2.1: initializing an index structure according to index type parameters and construction parameters, wherein the construction parameters comprise branching factors branch and K-means maximum iteration times I max And initial centroid selection method C alg
S2.2: calculating the centroid of the complete data to be clustered, and constructing a root node of the index tree;
s2.3: according to C alg Selecting branch initial partition centroids, and dividing data into the nearest partitions;
s2.4: updating the partition centroid and repartitioning the data until the partition centroid is unchanged or the update reaches I max
S2.5: constructing nodes according to the partition centroids, and adding the nodes to a child node set of a father node;
s2.6: repeating the steps S2.3 to S2.5 until the number of data in the partition is less than that of the branch, obtaining a constructed global index of the prior search K-means tree, and expressing the global index by using a variable;
s2.7: and distributing the global index variable of the priority search K-means tree to each working node through the main node of the distributed cluster.
4. The distributed data clustering method based on the local direction centrality as claimed in claim 2, wherein the step S3 comprises:
s3.1: carrying out data sampling on the complete data to be clustered according to the partition sampling rate proportion;
s3.2: calculating Hilbert coding values of the sampling data and sequencing the sampling points according to the values;
s3.3: uniformly dividing the sampling points into intervals with the number corresponding to the number of the partitions, and recording division positions as the partitions;
s3.4: a rectangular partition range is formed by expanding sampling points, and partition IDs of all data are generated;
s3.5: and distributing the corresponding partition data to each working node of the cluster according to the partition ID.
5. The distributed data clustering method based on the local direction centrality as claimed in claim 2, wherein the step S4 comprises:
s4.1: performing K neighbor search on the data to be clustered in each partition based on a global index structure of a prior search K-means tree to obtain K neighbor points of the data points;
s4.2: calculating an angle variance DCM formed by each data point and k neighbor points thereof in a two-dimensional space;
Figure FDA0003892828640000021
wherein k is the number of neighboring points of the data point, (theta) 12 ,…,θ k ) Is the angle formed by the data point and the neighbor point and satisfies
Figure FDA0003892828640000022
S4.3: merging DCM value results of all data points of the partitions, sequencing, and calculating according to a boundary point proportion parameter Ratio to obtain a threshold value T DCM If DCM is less than T DCM Marking the data points as interior points, otherwise marking the data points as boundary points;
s4.4: computing an interior point p i And all boundary points q m The minimum distance therebetween is taken as the inner point p i Is a reachable distance r i I.e. r i =min(d(p i ,q m ) ); wherein p is i Is an interior point, q m As boundary point, d (p) i ,q m ) Is an interior point p i And the boundary point q m The distance therebetween;
s4.5: merging the interior points according to a connection rule that the distance between the two interior points is not more than the sum of the two-point reachable distances and marking the cluster ID of the interior points, namely d (p) i ,p j )≤r i +r j (ii) a Wherein r is i 、r j Respectively an inner point p i Internal point p j D (p) of i ,p j ) Is an interior point p i And an interior point p j The distance therebetween;
s4.6: searching the inner point most adjacent to the boundary point, and taking the class cluster ID of the searched inner point as the class cluster ID of the boundary point to obtain a local class cluster.
6. The data distributed clustering method based on local direction centrality as claimed in claim 1, wherein the step S5 comprises:
for local cluster C α The reachable distance of each internal point is sequenced to obtain the maximum reachable distance R α =max(r i );
Performing inter-partition cluster merging according to the maximum reachable distance and the connection rule that the distance between the two clusters is not more than the sum of the reachable distances of the two clusters, namely D (C) a ,C β )≤R α +R β And updating the class cluster ID to generate a complete class cluster, wherein C a 、C β Are two different local clusters, R α 、R β Are respectively local cluster class C a 、C β The maximum reachable distance of D (C) a ,C β ) Is C a 、C β The distance between the inner points of the maximum reachable distance is taken.
7. The distributed clustering method of data based on local directional centrality according to claim 1, wherein after step S6, the method further comprises: and outputting the clustering evaluation result and the calculation time result to a distributed file system.
8. A data distributed clustering device based on local direction centrality is characterized by comprising:
the initialization module is used for receiving parameters required by the clustering task, including environment parameters, clustering algorithm parameters, partition parameters and neighbor search parameters, configuring and registering the sequencer, and reading complete data to be clustered from the distributed file system;
the global index building module is used for building a global index of a priority search K-means tree based on the read complete data to be clustered, and sharing the global index to each working node through a main node of the distributed cluster;
the data partitioning module is used for partitioning the complete data to be clustered by combining a data sampling and Hilbert curve partitioning method, obtaining a corresponding partition ID, and sending partition data corresponding to the partition ID to a corresponding working node through a main node of the distributed cluster;
the local clustering module is used for executing a CDC local clustering algorithm in parallel through each working node of the distributed cluster, and specifically comprises: respectively carrying out k nearest neighbor search on partition data and calculating a DCM value by a working node through a shared global index, dividing an internal point and a boundary point according to the relation between the DCM value and a DCM threshold value, merging the internal point based on the reachable distance from the internal point to the boundary point, classifying the merged internal point into the same internal point cluster, marking an internal point cluster ID, searching the internal point closest to the boundary point and marking the boundary point cluster ID to obtain a local cluster, wherein the DCM value is an angle variance formed by a data point and k adjacent points thereof in a two-dimensional space;
the global merging module is used for merging the local cluster among the partitions through the main node of the distributed cluster according to the maximum reachable distance of the local cluster to generate a complete cluster as a clustering result;
and the result output module is used for outputting the clustering result to the distributed file system.
9. A computer-readable storage medium, on which a computer program is stored, which program, when executed, carries out the method of any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
CN202211265216.1A 2022-10-17 2022-10-17 Data distributed clustering method and device based on local direction centrality Pending CN115658809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211265216.1A CN115658809A (en) 2022-10-17 2022-10-17 Data distributed clustering method and device based on local direction centrality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211265216.1A CN115658809A (en) 2022-10-17 2022-10-17 Data distributed clustering method and device based on local direction centrality

Publications (1)

Publication Number Publication Date
CN115658809A true CN115658809A (en) 2023-01-31

Family

ID=84988053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211265216.1A Pending CN115658809A (en) 2022-10-17 2022-10-17 Data distributed clustering method and device based on local direction centrality

Country Status (1)

Country Link
CN (1) CN115658809A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786434A (en) * 2023-11-22 2024-03-29 太极计算机股份有限公司 Cluster management method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786434A (en) * 2023-11-22 2024-03-29 太极计算机股份有限公司 Cluster management method

Similar Documents

Publication Publication Date Title
Ribeiro et al. A survey on subgraph counting: concepts, algorithms, and applications to network motifs and graphlets
Buluç et al. Recent advances in graph partitioning
CN109033340B (en) Spark platform-based point cloud K neighborhood searching method and device
CN111400555B (en) Graph data query task processing method and device, computer equipment and storage medium
CN111597230A (en) Parallel density clustering mining method based on MapReduce
CN115658809A (en) Data distributed clustering method and device based on local direction centrality
Kim et al. Parallel computation of k-nearest neighbor joins using MapReduce
CN109657197B (en) Pre-stack depth migration calculation method and system
CN115563927A (en) Chip wiring method for accelerating construction of minimum right-angle Steiner tree by GPU
CN115358308A (en) Big data instance reduction method and device, electronic equipment and storage medium
Wang et al. Efficient parallel spatial skyline evaluation using mapreduce
CN111274241A (en) Method and apparatus for parallel processing of map data
Ismaeel et al. An efficient workload clustering framework for large-scale data centers
CN109711439A (en) A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm
Huang et al. An efficient algorithm for skyline queries in cloud computing environments
CN116028832A (en) Sample clustering processing method and device, storage medium and electronic equipment
Abdolazimi et al. Connected components of big graphs in fixed mapreduce rounds
CN108717444A (en) A kind of big data clustering method and device based on distributed frame
Rad et al. An intelligent algorithm for mapping of applications on parallel reconfigurable systems
Lieber et al. Scalable high-quality 1D partitioning
CN112669907A (en) Pairing protein interaction network comparison method based on divide-and-conquer integration strategy
Li et al. An accurate and efficient large-scale regression method through best friend clustering
Jin et al. Pattern learning based parallel ant colony optimization
CN114528439A (en) Extremely large group enumeration method and device based on distributed system
Predari Load balancing for parallel coupled simulations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination