CN111275099A

CN111275099A - Clustering method and clustering system based on grid granularity calculation

Info

Publication number: CN111275099A
Application number: CN202010055555.1A
Authority: CN
Inventors: 徐慧; 姚舜宇; 李倩云; 高鳗; 张伟; 陈宏伟; 刘伟; 宗欣露; 苏军; 严灵毓
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-12

Abstract

The invention belongs to the technical field of data processing and discloses a clustering method and a clustering system based on grid granularity calculation, wherein the clustering method based on the grid granularity calculation comprises the steps of reading an original data set; initializing relevant parameters; dividing n-dimensional data into mutually disjoint grids, traversing all the grids and marking the grids as a central grid, an edge grid and a noise grid; and performing density calculation based on granularity on the processed grid, obtaining a clustering center according to a density peak value, and finally outputting a clustering result. On the basis of a K-means algorithm, the influence of noise is eliminated, and the selection of an initial point is optimized; the problem of large calculation amount of a fast clustering algorithm based on density peak values is solved through gridding optimization, and excessive manual decision and errors caused by the manual decision are avoided. By introducing the concept of granularity, the edge of a dense area is prevented from being damaged during gridding, and the accuracy of the cluster initialization center point is improved.

Description

Clustering method and clustering system based on grid granularity calculation

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a clustering method and a clustering system based on grid granularity calculation.

Background

Currently, the closest prior art: the development of big data technology, along with the rapid increase of the generated data volume, the mining of big data becomes a common problem, and the traditional data storage and processing data can not meet the requirements. Clustering analysis has again become a research hotspot as an important technique for the analysis of various data. Conventional clustering algorithms include partition-based algorithms, hierarchy-based algorithms, density-based algorithms, and the like.

The objective of cluster analysis is to find structures hidden in data and to classify data having the same properties as much as possible into the same class according to some similarity measure.

The K-means algorithm is one of ten classic algorithms in the field of machine learning. The K-means algorithm is a hard clustering algorithm and is a typical target function clustering method based on a prototype, namely, a certain distance from a data point to the prototype is used as an optimized target function, and an adjustment rule of iterative operation is obtained by using a function extremum solving method. The K-means algorithm takes Euclidean distance as similarity measure, and solves the optimal classification of a corresponding initial clustering center vector V, so that the evaluation index J is minimum. The algorithm uses a sum of squared errors criterion function as a clustering criterion function.

The Fast Clustering (CFSFDP) algorithm based on density peak is a Clustering algorithm based on density, and takes a high-density area as a judgment basis. The CFSFDP algorithm first calculates the local density of each point by using a truncation distance, and then calculates the minimum distance between each data point and the data points whose local density is higher than them; then drawing a decision graph according to the calculated local density and minimum distance of each point, then manually selecting a clustering center in the decision graph, and then dividing the data points of the rest non-clustering centers into the clusters where the clustering centers closest to the data points are located; and finally, dividing each obtained cluster into a cluster core and a cluster halo so as to obtain a final clustering result. This nonparametric approach, compared to the conventional approach, is suitable for processing data sets of any shape and does not require setting the number of clusters in advance.

CLIQUE (clustering In QUEst) is a simple grid-based clustering method for finding density-based clusters In a subspace. CLIQUE divides each dimension into non-overlapping intervals, thereby dividing the entire embedding space of the data object into cells. Each attribute is divided into N equal parts, the whole data space is divided into a super-rectangular body set, data points of each unit are counted, the units larger than a certain threshold value S are called as dense units, and then the dense units are connected to form a class. Unlike other methods, it can automatically identify classes embedded in the data subspace.

Granularity is a database term, and in the field of computers, granularity refers to the minimum value of system memory expansion increment. Granularity issues are one of the most important aspects of designing a data warehouse. Granularity refers to the level of refinement or integration of the data held in the data units of a data warehouse. The higher the refinement degree is, the smaller the granularity level is; conversely, the lower the degree of refinement, the larger the granularity level. The main problem with granularity is to have it at a suitable level, which can be neither too high nor too low. A low level of granularity can provide exhaustive data, but takes up more storage space and requires longer query times. The high granularity level can be conveniently inquired at high speed. But cannot provide overly thin data.

In summary, as a classical algorithm for solving the clustering problem, the K-means algorithm is simple and fast, and when the structure set is dense and the difference between clusters is obvious, the clustering result is better, and when a large amount of data is processed, the algorithm has higher scalability and high efficiency. However, the conventional K-means algorithm also has a plurality of defects at present and needs to be further optimized; (1) the method of randomly selecting the initial clustering centers causes instability of an algorithm and is likely to fall into a locally optimal condition. (2) The K-means algorithm is sensitive to noise and isolated point data, the mass center of a cluster is taken as a cluster center and added into the next round of calculation, so that a small amount of data can greatly influence the average value, and results are unstable and even wrong. (3) Any cluster cannot be found, and generally only spherical clusters can be found. Because the K-means algorithm mainly measures the similarity between data objects by using the euclidean distance function, and uses the sum of squared errors as a criterion function, only spherical clusters with more uniformly distributed data objects can be generally found.

For the CFSFDP algorithm, the algorithm principle is simple and easy to realize, and the clustering effect is excellent and has great attention. However, there are some limitations, such as the truncation distance is selected by the user according to experience, and if the selection is not appropriate, the clustering result is poor. In addition, when the density of the data points is measured, only one constant truncation distance parameter exists, and a good clustering effect cannot be obtained under the condition that a plurality of high-density points exist in the same cluster at the same time.

CLIQUE has the advantage of high efficiency of grid-like algorithms, is insensitive to data input order, and does not need to assume any normative data distribution. It expands linearly with the size of the input data, has good scalability as the data dimensions increase, and is very effective for clustering of high dimensional data in large databases. It also has many limitations: (1) like most density-based clustering algorithms, grid-based clustering relies heavily on the choice of density thresholds (too high, clusters may be lost; too low, clusters that should be separated may be merged). (2) If there are different densities of clusters and noise, it may not be possible to find values that fit in all parts of the data space. (3) Many steps of the CLIQUE algorithm use an approximation algorithm, and the accuracy of the clustering results may be reduced accordingly.

The difficulty of solving the technical problems is as follows: the difficulty and the characteristic of the invention are mainly that how to integrate the idea of the granularity into a K-means algorithm, a clustering algorithm based on grids and a clustering algorithm based on density and realize the clustering method based on grid granularity calculation from the perspective of integration.

The significance of solving the technical problems is as follows: the process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. There are a number of classification problems in the natural and social sciences. Cluster analysis is a statistical analysis method for studying (sample or index) classification problems. By solving the technical problem, the quality of final clustering can be improved, so that the final clustering effect is better.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a clustering method and a clustering system based on grid granularity calculation. The clustering method provided by the invention introduces the idea of granularity, solves the problems that the K-means clustering algorithm has high selection dependency on an initial central point and is sensitive to noise and isolated point data on the basis of the K-means clustering algorithm, and avoids the condition that the CFSFDP algorithm needs to manually participate in selecting the clustering center so as to improve the performance of the K-means clustering algorithm.

The invention is realized in such a way that a clustering method based on grid granularity calculation comprises the following steps: the data set is gridded and divided into a center grid, an edge grid and a noise grid by calculating the density of the grid and the adjacent grid. And density calculation based on granularity is introduced to avoid damaging the edge of the dense area, and finally, the accuracy of the cluster initialization center point is improved.

The invention eliminates the influence of noise and optimizes the selection of the initial point on the basis of the K-means algorithm. The problem of large calculation amount of the CFSFDP algorithm is solved through gridding optimization, and excessive manual decision and errors caused by the manual decision are avoided. By introducing the concept of granularity, the edge of a dense area is prevented from being damaged during gridding, and the accuracy of the cluster initialization center point is improved.

Further, the clustering method based on grid granularity calculation further comprises the following steps:

reading and initializing a data set;

step two, gridding the data set;

step three, calculating the density of each grid, classifying the grids, dividing the grids into mutually-disjoint grids, traversing all the grids, marking the grids as a central grid, an edge grid and a noise grid, and removing the noise grid;

fourthly, performing granularity calculation on the processed grids;

step five, obtaining a clustering center according to the density peak value;

and step six, outputting a clustering result.

Further, in step one, initializing data includes: the data set is read and initialized so that all data is projected into the data space.

Further, in the second step, gridding the data set includes: uniformly dividing each dimension of the data space into the same segment number, and recording the segment number as f; forming a plurality of grid objects with the same size, eliminating the grid objects with the data object number of 0 in the grid objects, and marking the rest grid objects as a grid object set as G; the number of grid objects in the grid object set is recorded as N;

the specific steps of gridding are as follows:

1) uniformly dividing each dimension in a data set space containing n data objects into the same segment number, marking as f, giving an initial value of f 2, and forming a grid;

2) removing grids with the data object number of 0 in the grid objects, and recording the number of the remaining non-empty grid objects as N;

3) if N < N/6, making f equal to f +1, returning to Step1, otherwise making f equal to f/2, returning to Step 1);

4) determining a grid object set G, dividing the number of segments f and the number of grid objects N.

Further, in step three, calculating the density of the grid and classifying includes: the number of data points contained in each grid is calculated and the grids are classified into the following three categories:

(1) if all the adjacent grids of the grid are grids containing data, marking as a central grid;

(2) if the adjacent grids of the grid have a central grid and an empty grid, marking as an edge grid;

(3) if the adjacent grid of the grid is only an empty grid, the adjacent grid is marked as a noise grid, and the noise grid data is removed.

Further, in step four, the density calculation based on the granularity includes: centering on a central mesh, and making the central mesh and all adjacent meshes total 3^dThe individual grids are defined as the minimum computational granularity (where d is the dimension);

putting all the central grids into a calculation queue; calculating the density of the set of grids at the granularity according to the divided grids; sequentially carrying out density calculation based on granularity on the central grids according to the queues;

in the fifth step, the method for obtaining the clustering center comprises the following steps: and clustering the grid objects and the data points in each grid object through the initial points obtained in the fourth step.

Further, the granularity-based density calculation method includes:

step1: setting the middle point of the area with the highest density as an initial point; if the number of the current initial points reaches k, stopping the current step;

the method for acquiring the initial point comprises the following steps: sorting the density of the minimum granularity, wherein the grid object at the position of the cluster center has higher density rho, setting the central grid midpoint of the area with the highest density as an initial point, marking the grid set as the initial point and the adjacent grid as processed grids, setting the grid density of the area as 0, recalculating the granularity density and sorting if the adjacent grid of the central grid is also the central grid, and setting the central grid midpoint of the area with the highest density as an initial point again until the number of the initial points reaches k;

step2: otherwise, marking the grid set as the initial point and the adjacent grid as the processed grid, setting the grid density of the area as 0, recalculating the granularity, and returning to the step 1;

another object of the present invention is to provide a clustering control system for implementing the grid granularity calculation-based clustering method.

It is a further object of the present invention to provide a computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface for implementing said grid granularity calculation based clustering method when executed on an electronic device.

It is another object of the present invention to provide a computer-readable storage medium, comprising instructions which, when run on a computer, cause the computer to perform the above-mentioned clustering method based on grid-granularity computation.

Another object of the present invention is to provide a calculator for implementing the clustering method based on grid granularity calculation.

In summary, the advantages and positive effects of the invention are: the invention provides a clustering method based on grid granularity calculation, which reads an original data set; initializing relevant parameters; dividing n-dimensional data into mutually disjoint grids, traversing all the grids and marking the grids as a central grid, an edge grid and a noise grid; and then carrying out granularity calculation on the processed grid, obtaining a clustering center according to the density peak value, and finally outputting a clustering result. The method has the advantages of high accuracy, small difference of clustering effects of different data sets and small parameter dependence.

Compared with the prior art, the invention has the advantages that: the invention gridds the data set and divides the data set into a central grid, an edge grid and a noise grid by calculating the density of the grids and the adjacent grids. And density calculation based on granularity is introduced to avoid damaging the edge of the dense area, and finally, the accuracy of the cluster initialization center point is improved.

Drawings

Fig. 1 is a flowchart of a clustering method based on grid granularity calculation according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a clustering method based on grid granularity calculation according to an embodiment of the present invention.

Fig. 3 is a graph of the minimum computation granularity of 3 × 3 meshes in a divided two-dimensional mesh provided in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The K-means algorithm is one of ten classic algorithms in the field of machine learning, and is widely applied to science and industry due to simplicity and high efficiency. However, the conventional K-means clustering method has two obvious disadvantages:

the algorithm initially selects cluster centers at random. For clustering algorithms, the initial clustering center is important because it is the basis for the computation of the result, and the next center is updated from the previous center. It is difficult and time consuming to converge to the correct result if the initial center is randomly generated.

The algorithm is sensitive to noise or isolated points, and since the center point is calculated by mean, once a noise point is classified into a designated cluster, the center point is necessarily deviated from the actual position.

Aiming at the problems in the prior art, the invention provides a clustering method and a clustering system based on grid granularity calculation, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the clustering method based on grid granularity calculation provided by the embodiment of the present invention includes:

s101: the data set is read and initialized.

S102: the data set is gridded.

S103: the density of each mesh is calculated and classified, and the noise mesh is removed.

S104: the granular density calculation is performed for the region containing the central mesh.

S105: and taking the point of the previous step as an initial point, and carrying out k-means clustering.

S106: and outputting the result.

Fig. 2 is a principle of a clustering method based on grid granularity calculation according to an embodiment of the present invention.

In step S101, initializing data includes: the data set is read and initialized so that all data is projected into the data space.

In step S102, gridding the data set includes: the algorithm needs to divide each dimension of the data space into the same segment number uniformly, and the segment number is marked as f; forming a plurality of grid objects with the same size, eliminating the grid objects with the data object number of 0 in the grid objects, and marking the rest grid objects as a grid object set as G; the number of grid objects in the grid object set is recorded as N, and experimental results show that the algorithm has better clustering quality when the number N of the grid objects in the grid object set is greater than or equal to 1/6 of the data volume N in the data set.

As a preferred embodiment of the present invention, the specific steps of gridding are as follows:

step1, uniformly dividing each dimension of the data set space containing n data objects into the same number of segments, and marking the segments as f (giving f an initial value of 2) to form a grid.

And Step2, removing grids with the data object number of 0 in the grid objects, and recording the number of the remaining non-empty grid objects as N.

And Step3, if N is less than N/6, making f equal to f +1 and returning to Step1, otherwise, making f equal to f/2 and returning to Step 1.

And Step4, determining a grid object set G, dividing the number of segments f and the number of grid objects N.

In step S103, the calculating the density of the mesh and classifying includes: the number of data points contained in each grid is calculated and the grids are classified into the following 3 categories:

(1) if all the adjacent grids of the grid are grids containing data, the grid is marked as a central grid.

(2) If the adjacent grids of the grid have a center grid and an empty grid, the grid is marked as an edge grid.

As a preferred embodiment of the present invention, the side length can also be directly made as follows when dividing the grid:

in step S104, the density calculation based on the granularity includes: centering on a central mesh, and making the central mesh and all adjacent meshes together 3^dThe individual grids are defined as the minimum computational granularity (where d is the dimension).

If the minimum particle size is chosen to be 2^dThen the position of the central mesh in the set of meshes needs to be considered, and the granularity is not adopted to avoid the error caused by the position; if the minimum granularity is selected to be higher, the probability of multiple density peaks under the current mesh partitioning method becomes higher. Thus, option 3^dAs the minimum computational granularity.

All the central grids are put into a calculation queue. The density of the set of meshes at the granularity is calculated from the divided meshes. And sequentially carrying out density calculation based on granularity on the central grids according to the queues. As shown in fig. 3, in a divided two-dimensional grid, 3 × 3 grids are selected as the minimum computation granularity.

The following procedure is performed for each granularity:

step 4.1: the midpoint of the region of highest density is set as an initial point.

Step 4.1.1: and if the number of the current initial points reaches k, stopping the current step.

Step 4.2: otherwise, marking the grid set as the initial point and the adjacent grid as the processed grid, setting the grid density of the area as 0, recalculating the granularity, and returning to the step 4.1.

In step S104, acquiring the initial point includes: and (3) sorting the density with the minimum granularity, wherein the grid object at the position of the cluster center has higher density rho, setting the central grid midpoint of the area with the highest density as an initial point, marking the grid and the adjacent grid which are set as the initial point as processed grids, setting the grid density of the area as 0, if the adjacent grid of the central grid is also the central grid, carrying out the same processing on the central grid, recalculating the granularity density and sorting, and setting the central grid midpoint of the area with the highest density as an initial point again until the number of the initial points reaches k.

In step S105, the cluster calculation includes: and clustering the grid object and the data points in each grid object through the initial points obtained in the steps, and outputting a clustering result.

The invention is further described below in connection with a comparative table with the prior art.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A clustering method based on grid granularity calculation is characterized by comprising the following steps:

reading and initializing a data set;

step two, gridding the data set;

fourthly, performing density calculation based on granularity on the processed grids;

step five, obtaining a clustering center according to the density peak value;

and step six, outputting a clustering result.

2. The method for clustering based on grid granularity calculation of claim 1, wherein in the first step, initializing data comprises: the data set is read and initialized so that all data is projected into the data space.

3. The method for clustering based on grid-granularity computation of claim 1, wherein in the second step, gridding the data set comprises: uniformly dividing each dimension of the data space into the same segment number, and recording the segment number as f; forming a plurality of grid objects with the same size, eliminating the grid objects with the data object number of 0 in the grid objects, and marking the rest grid objects as a grid object set as G; the number of grid objects in the grid object set is recorded as N;

the specific steps of gridding are as follows:

4. The method for clustering based on grid granularity calculation as claimed in claim 1, wherein in step three, calculating the density of the grid and classifying comprises: the number of data points contained in each grid is calculated and the grids are classified into the following three categories:

5. The method for clustering based on grid granularity calculation as claimed in claim 1, wherein in the fourth step, the granularity-based density calculation comprises: centering on a central gridA center, the center grid and all adjacent grids are 3^dEach grid is defined as the minimum computational granularity, where d is the dimension;

6. The method for clustering based on grid granularity calculation of claim 5, wherein the method for calculating the granularity-based density comprises:

step1: setting the middle point of the area with the highest density as an initial point; if the number of the current initial points reaches k, stopping the current step; the method for acquiring the initial point comprises the following steps: sorting the density of the minimum granularity, wherein the grid object at the position of the cluster center has higher density rho, setting the central grid midpoint of the area with the highest density as an initial point, marking the grid set as the initial point and the adjacent grid as processed grids, setting the grid density of the area as 0, recalculating the granularity density and sorting if the adjacent grid of the central grid is also the central grid, and setting the central grid midpoint of the area with the highest density as an initial point again until the number of the initial points reaches k;

step2: otherwise, marking the grid set as the initial point and the adjacent grid as the processed grid, setting the grid density of the area as 0, recalculating the granularity, and returning to the step 1.

7. A clustering control system for implementing the grid granularity calculation-based clustering method according to any one of claims 1 to 6.

8. A computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface for implementing a grid-granularity computation-based clustering method according to any one of claims 1 to 6 when executed on an electronic device.

9. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the grid granularity calculation-based clustering method of any one of claims 1 to 6.

10. A calculator for implementing the grid granularity calculation-based clustering method according to any one of claims 1 to 6.