CN109978066B

CN109978066B - Rapid spectral clustering method based on multi-scale data structure

Info

Publication number: CN109978066B
Application number: CN201910257841.3A
Authority: CN
Inventors: 陈旻昕; 张重阳; 朱国丰; 吴晨健; 陈虹
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2020-10-30
Anticipated expiration: 2039-04-01
Also published as: CN109978066A

Abstract

The invention discloses a rapid spectral clustering method based on a multi-scale data structure. The invention discloses a rapid spectral clustering method based on a multi-scale data structure, which comprises the following steps: step 1: for input d-dimensional spatial data V ═ V₁，v₂，...，v_nTherein of

Preprocessing data by adopting a K-d tree algorithm to obtain a series of data sets U-U₁，u₂，...，u_mA (where n is the number of data points, d is the dimensionality of the data, and m is the number of data sets) and a tree structure; step 2: calculating a similarity matrix W between the sets; wherein the kernel function used to calculate W is

In specific implementation, some sampling points are selected from the sets, and the similarity degree of the two sets is measured by the Euclidean distance between the sampling points. The invention has the beneficial effects that: the method creatively adopts a k-d tree algorithm to obtain a series of data sets, and replaces the original similarity matrix constructed based on data by calculating the similarity matrix between the sets.

Description

Rapid spectral clustering method based on multi-scale data structure

Technical Field

The invention relates to the field of clustering, in particular to a rapid spectral clustering method based on a multi-scale data structure.

Background

From the perspective of machine vision and machine learning, clustering is an unsupervised learning process that classifies data according to their similarity, so that data within the same category have the greatest similarity to each other, while data between different categories have the least similarity to each other. Data clustering is widely applied to the aspects of medical image segmentation, financial data classification and the like; in recent years, with the development of artificial intelligence, machine learning and computer vision, the research of data clustering methods has become particularly important. Currently, the data clustering method is generally based on the following points: 1. a clustering method based on partitioning; 2. a density-based clustering method; 3. a clustering method based on graph theory.

The K-means algorithm [1] is a representative of a clustering algorithm based on division, and is one of the most common data clustering methods at present due to the characteristics of simple implementation, high efficiency and the like. The algorithm randomly selects k data as initial clustering centers of the clusters, calculates the distance between each data and the clustering centers, and assigns the data to the nearest clustering center. And after all the data are distributed, recalculating the cluster center of each cluster. If no data is redistributed to other cluster centers or all cluster centers are not changed, the operation is ended, otherwise, the process is repeated until the above termination condition is met.

The DBSCAN algorithm [2], as a representative of the density-based clustering algorithm, can effectively avoid noise interference while being able to divide a sufficiently high-density region into clusters. The method has two parameters, a search radius r and a minimum inclusion point minPoints. Randomly selecting unvisited data to start, setting the data as an initial point, and finding out all nearby data points within the distance of a search radius r. If the number of data points is greater than or equal to minPoints, then the current data forms a cluster with nearby data, and the initial point is marked as visited (visited). And then recursively processing all the unvisited data according to the same method, and expanding the cluster. If the number of data points is less than minPoints, the data is tentatively designated as noisy data. If all points in the cluster are marked as visited, then those points that are visited are reprocessed in the same manner as described above, knowing that all data is marked as visited.

Spectral Clustering algorithm [3]The spectral clustering algorithm is developed from the graph theory, and has stronger adaptability to data distribution and more excellent clustering effect. The algorithm first constructs numbers from the input dataAccording to the similarity matrix W and the degree matrix D, according to the formula L ═ D^-1/2WD^-1/2Constructing a Laplacian matrix L, calculating the first k characteristic values and characteristic vectors f of the L, wherein the minimum characteristic values and characteristic vectors f are the first k, normalizing the matrix formed by the respective characteristic vectors f according to rows, taking each row of the normalized matrix as a sample, and clustering the samples by using k-means to obtain the final clustering result.

The traditional technology has the following technical problems:

although the K-means algorithm has high data processing efficiency, the K-means algorithm cannot effectively process a non-convex data set, and often can only be used as a small part of a data processing method and cannot independently complete some data classification tasks.

The DBSCAN algorithm cannot well reflect high-dimensional data, and meanwhile, if the density distribution of the data is not uniform and the difference of the clustering distances is large, the clustering result is poor.

The Spectral Clustering algorithm is one of the most effective Clustering algorithms at present, can well process various types of data, has own advantages when processing high-dimensional data due to the characteristics of the algorithm, but needs to construct a similarity matrix and solve the characteristic vector of the similarity matrix in data processing due to the Spectral Clustering algorithm, has large calculation cost of two steps, and is unacceptable for the large-scale data and the image data with larger size at present.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a rapid spectral clustering method based on a multi-scale data structure, and the spectral clustering method is excellent in data processing performance, but the calculation cost is a great problem. The method carries out optimization and improvement aiming at the two steps of construction of the similarity matrix and decomposition of the characteristic vector in the spectral clustering algorithm, effectively improves the operating efficiency of the spectral clustering algorithm, and enables the spectral clustering algorithm to be successfully applied to larger-scale data.

In order to solve the technical problem, the invention provides a fast spectral clustering method based on a multi-scale data structure, which comprises the following steps:

step 1: for input d-dimensional spatial data V ═ V₁，v₂，...，v_nTherein of

Preprocessing data by adopting a K-dtree algorithm to obtain a series of data sets U ═ U₁，u₂，...，u_mN is the number of data points, d is the dimensionality of the data, and m is the number of data sets, a conversion matrix H and a tree structure;

step 2: calculating a similarity matrix W between the sets; wherein the kernel function used to calculate W is

Step 3, selecting depth of l_initl_initThe layer is used as an initial layer of the K-dtree, each node of the layer is traversed in sequence, each node is operated, and finally a feature vector eventor of a root node is obtained;

and 4, converting the feature vector back to the original data space by making Y equal to H multiplied by Ector.

Step 5, normalizing the Y according to the rows;

step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a fuzzy C-means algorithm, and clustering the characteristic vectors into k types of C₁，C₂，...，C_k；

7, if the ith row of the Y belongs to the jth class, then the original data x is set_iFall into class j.

In one embodiment, the step 1 comprises the following steps:

step 1.1 construct root node S⁰,S⁰The data in (1) is the whole data set V, the variance V on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxV is found and is set as maxTim,

step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V₁And V₂Two subset slicing is performed by passing through the slicing pointAnd realized by a hyperplane which is vertical to a coordinate axis maxTim; generating left and right child nodes with depth of 1 from root node

And

the left node corresponds to the data of the coordinate maxTim smaller than the dividing point, and the right node corresponds to the data of the coordinate maxTim larger than the dividing point;

step 1.3, repeating the above process for each node, and stopping if the maxV calculated by the node is less than a certain threshold or the node only contains one piece of data; finding out all leaf nodes as the aforementioned data set; wherein the content of the first and second substances,

where l represents the depth of the tree and i represents the node index for the current depth;

step 1.4. from the resulting data set U ═ U₁，u₂，...，u_mGet the transform matrix H ═ H based on the following equation₁，h₂，...，h_mIn which h is_i＝{h_1i，h_2i，...，h_ni}; wherein | u_jI represents a data set u_jThe number of the middle data;

in one embodiment, the step 3 comprises the following steps:

step 3.1, calculating the similarity matrix of each node

Is a main sub-type of order s of W, i.e.

The specific method is to directly select corresponding rows and columns from W, wherein m₁Is a node S^l-1The number of middle sets;

step 3.2 is based on the formula

Obtaining a similarity matrix with lower dimensionality

For the initial layer, this Q^lIs of size m₁×m₁The identity matrix of (1);

step 3.3 calculation

Of the "row" and "degree of acquisition matrix

Step 3.4 calculating Laplacian matrix

Step 3.5, calculating the first k eigenvectors of the Laplacian matrix to obtain m₁The feature vector e with the size of x k, and the obtained feature vector is processed by a formula

The feature vector is converted back to the original space, and the obtained result is added to the father node S^l-1Of the conversion matrix Q^l-1The addition method is as follows:

and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, and repeating the step 3 until the depth of the layer is 0 to obtain a feature vector matrix eventor corresponding to the first k feature values of the 0 th layer, namely the original data layer.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any of the methods.

A processor for running a program, wherein the program when running performs any of the methods.

The invention has the beneficial effects that:

according to the method, a series of data sets are obtained by innovatively adopting a K-d tree algorithm, the original similarity matrix constructed based on data is replaced by calculating the similarity matrix between the sets, and the dimension of the similarity matrix based on the sets is far smaller than that of the similarity matrix based on data because the number m of the data sets is far smaller than the number n of the data.

In addition, the method also innovatively utilizes the tree structure of the K-d tree, and approximates a high-dimensional characteristic decomposition problem by using a plurality of low-dimensional characteristic decomposition problems, thereby intuitively solving the problem of large calculation amount of characteristic decomposition of the spectral clustering algorithm.

Mathematically, from the angle analysis of complexity, the complexity of the similarity matrix constructed by the original spectral clustering algorithm is O (n)²) The complexity of the eigen decomposition is O (n)³). In the method, the complexity O (ndlog (n)) preprocessed by a spectral clustering algorithm is O (m) in the complexity of constructing a similarity matrix²) The complexity of the eigen decomposition is O (n). Wherein m is the number of data sets, n is the number of data, and d is the dimensionality of the data, so that the complexity can be found, and the method embodies the advantage of high efficiency.

Drawings

FIG. 1 is a schematic diagram of a tree structure obtained by K-dtree in the fast spectral clustering method based on a multi-scale data structure of the present invention.

FIG. 2 is a schematic diagram of an artificial data set in the multi-scale data structure-based fast spectral clustering method of the present invention.

FIG. 3 is a cluster structure diagram of the fast spectral clustering method based on multi-scale data structure.

FIG. 4 is a schematic diagram of a selected image from the Weizmann data set in the multi-scale data structure-based fast spectral clustering method according to the present invention.

FIG. 5 shows the results after five categories of the fast spectral clustering method based on multi-scale data structure according to the present invention.

FIG. 6 is the boundary of an object extracted according to the gold standard in the multi-scale data structure-based fast spectral clustering method of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Preprocessing data by adopting a K-d tree algorithm to obtain a series of data sets U-U₁，u₂，...，u_mWhere n is the number of data points, d is the dimensionality of the data, and m is the number of data sets, and a tree structure. The method comprises the following specific steps:

step 1.1 construct root node S⁰,S⁰The data in (1) is the whole data set V, the variance V on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxV is found and is set as maxTim (because the greater the variance is, the lower the coupling degree between the data is, the smaller the similarity between the data is),

step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V₁And V₂Two subsets (also into a plurality of subsets, here two subsets are taken as an example), the slicing is realized by a hyperplane passing through the slicing point and being perpendicular to the coordinate axis maxDim. Generating left and right child nodes with depth of 1 from root node

And

the left node corresponds to data in the coordinate maxDim that is smaller than the cut-off point, and the right node corresponds to data in the coordinate maxDim that is larger than the cut-off point.

Step 1.3 repeats the above process for each node, and stops if the maxV calculated by the node is less than a certain threshold or the node contains only one piece of data. All leaf nodes are found as the aforementioned data set. Here, the degree of coupling within each data set is high. And a tree structure as shown in fig. 1 is obtained. Wherein the content of the first and second substances,

where I denotes the depth of the tree and I denotes the node index of the current depth

Step 1.4. from the resulting data set U ═ U₁，u₂，...，u_mGet the transform matrix H ═ H based on the following equation₁，h₂，...，h_mIn which h is_i＝{h_1i，h_2i，...，h_ni}. Wherein | u_jI represents a data set u_jThe number of data in.

Step 2: and calculating a similarity matrix W between the sets. Wherein the kernel function used to calculate W is

In specific implementation, some sampling points are selected from the sets, and the similarity degree of the two sets is measured by the Euclidean distance between the sampling points.

Step 3, selecting depth of l_initAs an initial layer of the K-d tree, sequentially traversing each node of the layer, and performing the following operations on each node:

step 3.1, calculating the similarity matrix of each node

Is a main sub-type of order s of W, i.e.

The specific method is to directly select corresponding rows and columns from W, wherein m₁Is a node S^l-1The number of middle sets.

Step 3.2 is based on the formula

Obtaining a similarity matrix with lower dimensionality

For the initial layer, this Q^lIs of size m₁×m₁The identity matrix of (2).

Step 3.3 calculation

Of the "row" and "degree of acquisition matrix

Step 3.4 calculating Laplacian matrix

And 4, converting the feature vector back to the original data set space by making Y equal to H multiplied by Eectror _ R.

And 5, normalizing the Y by rows.

Step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a Fuzzy C-means algorithm, and clustering the characteristic vectors into k types of C₁，C₂，...，C_k。

It can be seen from the foregoing technical solutions that the method innovatively adopts a K-d tree algorithm to obtain a series of data sets, and replaces the original similarity matrix constructed based on data by calculating the similarity matrix between the sets, because the number m of the data sets is much smaller than the number n of the data, the dimension of the similarity matrix based on the sets will also be much smaller than that based on data.

Mathematically, from the angle analysis of complexity, the complexity of the similarity matrix constructed by the original spectral clustering algorithm is O (n)²) The complexity of the eigen decomposition is O (n)³). In the method, the complexity O (ndlog (n)) preprocessed by a spectral clustering algorithm is O (m) in the complexity of constructing a similarity matrix²) The complexity of the eigen decomposition is O (n). Where m is the number of data sets, n is the number of data, and d is that of dataDimension, the method can show the advantage of high efficiency from the aspect of complexity.

Two examples are given here to show the reliability of the method from both effectiveness and efficiency points of view.

First is that one such method processes effects on a visualized artificial data set.

As shown in the artificial data set of figure 2,

1. the data set was first processed as described above, with a threshold of 0.001 for the K-d tree, resulting in a total of 126 data sets. The height of the tree structure obtained by the K-d tree is 10.

2. According to the obtained data set U ═ U₁，u₂，...，u_mGet the transform matrix H ═ H based on the following equation₁，h₂，...，h_mIn which h is_i＝{h_1i，h_2i，...，h_ni}. Wherein | u_jI represents a data set u_jThe number of data in.

For example, u₁Four data in the set, corresponding to the 4 th to 7 th data in the original data set V, then h₁＝[0 00 0.5 0.5 0.5 0.5 0...0]^T0.5 is by

And (4) obtaining the product.

2. And calculating a similarity matrix W between the sets. The kernel function used is

The value of the scale parameter σ is here represented by the document [4 ]]The self-adaptive spectral clustering algorithm selects some sampling points from the sets, and measures the similarity degree of the two sets by using Euclidean distance between the sampling points.

3. Selecting a layer with the depth of 4 as an initial layer of a k-d tree, sequentially traversing each node of the layer, and performing the following operations on each node:

3.1 calculating the similarity matrix of each node

Is a main sub-type of order s of W, i.e.

The specific method is to directly select corresponding rows and columns from W, wherein m₁Is a node S^lThe number of middle sets.

3.2 based on the formula

Obtaining a similarity matrix with lower dimensionality

For the initial layer, this Q^lIs of size m₁×m₁The identity matrix of (2).

3.3 calculation of

Of the "row" and "degree of acquisition matrix

3.4 computing Laplacian matrices

3.5 calculating the first 3 eigenvectors of the Laplacian matrix to obtain a vector with the size of m₁E is multiplied by 3, and the obtained feature vector is processed by a formula

Convert the feature vector back to the original space, add the result to its parent S^lIs converted into a matrixQ^l-1In (1). The addition method is as follows:

and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, namely the layer with the depth of 3, repeating the step 3 until the layer with the depth of 0 is obtained, and obtaining the eigenvector matrix eventor corresponding to the first 3 eigenvalues of the 0 th layer, namely the original data layer.

4. Let Y be H × Ector, convert the eigenvector back to the original dataset space.

5. Y is normalized by row.

6. Taking each line of Y as a data point, clustering the eigenvectors by using a Fuzzy C-means algorithm, and clustering the eigenvectors into 3 classes C₁，C₂，C₃。

7. If the ith row of Y belongs to the jth class, then the original data x is assigned_iFall into class j.

The experimental results are shown in the following chart:

this example demonstrates the effectiveness of the method.

As another example, we apply the method to skin datasets in UCI data, where 245057 are large-scale datasets. The same procedure as above is adopted to process the data, and the hardware equipment used is a personal computer which is configured as a CPU of Inter (R) core (TM) i7-3770 and 8GB internal memory. The time of the method on the data set is 4.31 seconds, the accuracy is 73.2%, and the traditional spectral clustering algorithm cannot run on the computer due to the large scale of the data.

As another example of image data, the image shown is selected from the We i zmann dataset. The image size is 300 × 225.

The format of the image data is processed so that the format of the data is [ pixel abscissa, pixel ordinate, pixel value ] for subsequent processing.

The data set was processed using the same method as for data, where the threshold of the k-d tree was 10, the number of classes was 5, and the initial depth was chosen to be 4.

FIGS. 5 and 6 show the results after five classes and the boundary of the object extracted according to the gold standard

The time taken for the method to be on the image is 1.56 seconds; however, the run time of the same conventional spectral clustering algorithm for pictures is 26.38 seconds.

The examples prove that the classification result of the method is good, the operation efficiency is very high, and the application range of spectral clustering is widened.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A fast spectral clustering method based on a multi-scale data structure is characterized by comprising the following steps:

Preprocessing data by adopting a K-dtree algorithm to obtain a series of data sets U ═ U₁，u₂，...，u_mThe method comprises the following steps of (1) converting a matrix H and a tree structure, wherein n is the number of data points, d is the dimensionality of data, and m is the number of data sets;

Step 3, selecting a layer with the depth of l as an initial layer of the K-d tree, sequentially traversing each node of the layer, operating each node, and finally obtaining a feature vector observer of the root node;

step 4, converting the eigenvector back to the original data space by making Y equal to H multiplied by Ector;

step 5, normalizing the Y according to the rows;

step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a Fuzzy C-means algorithm, and clustering the characteristic vectors into T-class C₁，C₂，…，C_T；

7, if the ith row of the Y belongs to the jth class, then the original data v is set_iClassification into the jth class;

the step 1 comprises the following steps:

step 1.1 construct root node S⁰，S⁰The data in the data set is the whole data set V, the variance J on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxJ is found out and is set as maxTim;

step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V₁And V₂The two subsets are divided by a hyperplane which passes through the division point and is vertical to the coordinate axis maxTim; generating left and right child nodes S with depth l from root node₁ ^lAnd S₂ ^l(ii) a The left node corresponds to the data of the coordinate maxDMi which is smaller than the dividing point, and the right node corresponds to the data of the coordinate maxDMi which is larger than the dividing point;

step 1.3, repeating the above process for each node, and stopping if maxJ calculated by the node is smaller than a certain threshold or the node only contains one piece of data; finding out all leaf nodes as the aforementioned data set; wherein the content of the first and second substances,

the step 3 comprises the following steps:

step 3.1, calculating the similarity matrix of each node

Is a main sub-type of order s of W, i.e.

step 3.2 is based on the formula

Obtaining a similarity matrix with lower dimensionality

For the initial layer, this Q^lIs of size m₁×m₁The identity matrix of (1);

step 3.3 calculation

Of the "row" and "degree of acquisition matrix

Step 3.4 calculating Laplacian matrix

Step 3.5 before calculation of Laplacian matrixk feature vectors to obtain m₁Feature vector E of size x k^lPassing the obtained feature vector through a formula

2. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of claim 1 are performed when the program is executed by the processor.

3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 1.

4. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of claim 1.