CN109978066B - Rapid spectral clustering method based on multi-scale data structure - Google Patents

Rapid spectral clustering method based on multi-scale data structure Download PDF

Info

Publication number
CN109978066B
CN109978066B CN201910257841.3A CN201910257841A CN109978066B CN 109978066 B CN109978066 B CN 109978066B CN 201910257841 A CN201910257841 A CN 201910257841A CN 109978066 B CN109978066 B CN 109978066B
Authority
CN
China
Prior art keywords
data
node
matrix
sets
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910257841.3A
Other languages
Chinese (zh)
Other versions
CN109978066A (en
Inventor
陈旻昕
张重阳
朱国丰
吴晨健
陈虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201910257841.3A priority Critical patent/CN109978066B/en
Publication of CN109978066A publication Critical patent/CN109978066A/en
Application granted granted Critical
Publication of CN109978066B publication Critical patent/CN109978066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a rapid spectral clustering method based on a multi-scale data structure. The invention discloses a rapid spectral clustering method based on a multi-scale data structure, which comprises the following steps: step 1: for input d-dimensional spatial data V ═ V1,v2,...,vnTherein of
Figure DDA0002014323290000011
Preprocessing data by adopting a K-d tree algorithm to obtain a series of data sets U-U1,u2,...,umA (where n is the number of data points, d is the dimensionality of the data, and m is the number of data sets) and a tree structure; step 2: calculating a similarity matrix W between the sets; wherein the kernel function used to calculate W is
Figure DDA0002014323290000012
In specific implementation, some sampling points are selected from the sets, and the similarity degree of the two sets is measured by the Euclidean distance between the sampling points. The invention has the beneficial effects that: the method creatively adopts a k-d tree algorithm to obtain a series of data sets, and replaces the original similarity matrix constructed based on data by calculating the similarity matrix between the sets.

Description

Rapid spectral clustering method based on multi-scale data structure
Technical Field
The invention relates to the field of clustering, in particular to a rapid spectral clustering method based on a multi-scale data structure.
Background
From the perspective of machine vision and machine learning, clustering is an unsupervised learning process that classifies data according to their similarity, so that data within the same category have the greatest similarity to each other, while data between different categories have the least similarity to each other. Data clustering is widely applied to the aspects of medical image segmentation, financial data classification and the like; in recent years, with the development of artificial intelligence, machine learning and computer vision, the research of data clustering methods has become particularly important. Currently, the data clustering method is generally based on the following points: 1. a clustering method based on partitioning; 2. a density-based clustering method; 3. a clustering method based on graph theory.
The K-means algorithm [1] is a representative of a clustering algorithm based on division, and is one of the most common data clustering methods at present due to the characteristics of simple implementation, high efficiency and the like. The algorithm randomly selects k data as initial clustering centers of the clusters, calculates the distance between each data and the clustering centers, and assigns the data to the nearest clustering center. And after all the data are distributed, recalculating the cluster center of each cluster. If no data is redistributed to other cluster centers or all cluster centers are not changed, the operation is ended, otherwise, the process is repeated until the above termination condition is met.
The DBSCAN algorithm [2], as a representative of the density-based clustering algorithm, can effectively avoid noise interference while being able to divide a sufficiently high-density region into clusters. The method has two parameters, a search radius r and a minimum inclusion point minPoints. Randomly selecting unvisited data to start, setting the data as an initial point, and finding out all nearby data points within the distance of a search radius r. If the number of data points is greater than or equal to minPoints, then the current data forms a cluster with nearby data, and the initial point is marked as visited (visited). And then recursively processing all the unvisited data according to the same method, and expanding the cluster. If the number of data points is less than minPoints, the data is tentatively designated as noisy data. If all points in the cluster are marked as visited, then those points that are visited are reprocessed in the same manner as described above, knowing that all data is marked as visited.
Spectral Clustering algorithm [3]The spectral clustering algorithm is developed from the graph theory, and has stronger adaptability to data distribution and more excellent clustering effect. The algorithm first constructs numbers from the input dataAccording to the similarity matrix W and the degree matrix D, according to the formula L ═ D-1/2WD-1/2Constructing a Laplacian matrix L, calculating the first k characteristic values and characteristic vectors f of the L, wherein the minimum characteristic values and characteristic vectors f are the first k, normalizing the matrix formed by the respective characteristic vectors f according to rows, taking each row of the normalized matrix as a sample, and clustering the samples by using k-means to obtain the final clustering result.
The traditional technology has the following technical problems:
although the K-means algorithm has high data processing efficiency, the K-means algorithm cannot effectively process a non-convex data set, and often can only be used as a small part of a data processing method and cannot independently complete some data classification tasks.
The DBSCAN algorithm cannot well reflect high-dimensional data, and meanwhile, if the density distribution of the data is not uniform and the difference of the clustering distances is large, the clustering result is poor.
The Spectral Clustering algorithm is one of the most effective Clustering algorithms at present, can well process various types of data, has own advantages when processing high-dimensional data due to the characteristics of the algorithm, but needs to construct a similarity matrix and solve the characteristic vector of the similarity matrix in data processing due to the Spectral Clustering algorithm, has large calculation cost of two steps, and is unacceptable for the large-scale data and the image data with larger size at present.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a rapid spectral clustering method based on a multi-scale data structure, and the spectral clustering method is excellent in data processing performance, but the calculation cost is a great problem. The method carries out optimization and improvement aiming at the two steps of construction of the similarity matrix and decomposition of the characteristic vector in the spectral clustering algorithm, effectively improves the operating efficiency of the spectral clustering algorithm, and enables the spectral clustering algorithm to be successfully applied to larger-scale data.
In order to solve the technical problem, the invention provides a fast spectral clustering method based on a multi-scale data structure, which comprises the following steps:
step 1: for input d-dimensional spatial data V ═ V1,v2,...,vnTherein of
Figure BDA0002014323270000031
Preprocessing data by adopting a K-dtree algorithm to obtain a series of data sets U ═ U1,u2,...,umN is the number of data points, d is the dimensionality of the data, and m is the number of data sets, a conversion matrix H and a tree structure;
step 2: calculating a similarity matrix W between the sets; wherein the kernel function used to calculate W is
Figure BDA0002014323270000032
Step 3, selecting depth of linitlinitThe layer is used as an initial layer of the K-dtree, each node of the layer is traversed in sequence, each node is operated, and finally a feature vector eventor of a root node is obtained;
and 4, converting the feature vector back to the original data space by making Y equal to H multiplied by Ector.
Step 5, normalizing the Y according to the rows;
step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a fuzzy C-means algorithm, and clustering the characteristic vectors into k types of C1,C2,...,Ck
7, if the ith row of the Y belongs to the jth class, then the original data x is setiFall into class j.
In one embodiment, the step 1 comprises the following steps:
step 1.1 construct root node S0,S0The data in (1) is the whole data set V, the variance V on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxV is found and is set as maxTim,
step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V1And V2Two subset slicing is performed by passing through the slicing pointAnd realized by a hyperplane which is vertical to a coordinate axis maxTim; generating left and right child nodes with depth of 1 from root node
Figure BDA0002014323270000041
And
Figure BDA0002014323270000042
the left node corresponds to the data of the coordinate maxTim smaller than the dividing point, and the right node corresponds to the data of the coordinate maxTim larger than the dividing point;
step 1.3, repeating the above process for each node, and stopping if the maxV calculated by the node is less than a certain threshold or the node only contains one piece of data; finding out all leaf nodes as the aforementioned data set; wherein the content of the first and second substances,
Figure BDA0002014323270000043
where l represents the depth of the tree and i represents the node index for the current depth;
step 1.4. from the resulting data set U ═ U1,u2,...,umGet the transform matrix H ═ H based on the following equation1,h2,...,hmIn which h isi={h1i,h2i,...,hni}; wherein | ujI represents a data set ujThe number of the middle data;
Figure BDA0002014323270000044
in one embodiment, the step 3 comprises the following steps:
step 3.1, calculating the similarity matrix of each node
Figure BDA0002014323270000045
Figure BDA0002014323270000046
Is a main sub-type of order s of W, i.e.
Figure BDA0002014323270000047
The specific method is to directly select corresponding rows and columns from W, wherein m1Is a node Sl-1The number of middle sets;
step 3.2 is based on the formula
Figure BDA0002014323270000048
Obtaining a similarity matrix with lower dimensionality
Figure BDA0002014323270000049
For the initial layer, this QlIs of size m1×m1The identity matrix of (1);
step 3.3 calculation
Figure BDA00020143232700000410
Of the "row" and "degree of acquisition matrix
Figure BDA00020143232700000411
Step 3.4 calculating Laplacian matrix
Figure BDA00020143232700000412
Step 3.5, calculating the first k eigenvectors of the Laplacian matrix to obtain m1The feature vector e with the size of x k, and the obtained feature vector is processed by a formula
Figure BDA0002014323270000051
The feature vector is converted back to the original space, and the obtained result is added to the father node Sl-1Of the conversion matrix Ql-1The addition method is as follows:
Figure BDA0002014323270000052
and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, and repeating the step 3 until the depth of the layer is 0 to obtain a feature vector matrix eventor corresponding to the first k feature values of the 0 th layer, namely the original data layer.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any of the methods.
A processor for running a program, wherein the program when running performs any of the methods.
The invention has the beneficial effects that:
according to the method, a series of data sets are obtained by innovatively adopting a K-d tree algorithm, the original similarity matrix constructed based on data is replaced by calculating the similarity matrix between the sets, and the dimension of the similarity matrix based on the sets is far smaller than that of the similarity matrix based on data because the number m of the data sets is far smaller than the number n of the data.
In addition, the method also innovatively utilizes the tree structure of the K-d tree, and approximates a high-dimensional characteristic decomposition problem by using a plurality of low-dimensional characteristic decomposition problems, thereby intuitively solving the problem of large calculation amount of characteristic decomposition of the spectral clustering algorithm.
Mathematically, from the angle analysis of complexity, the complexity of the similarity matrix constructed by the original spectral clustering algorithm is O (n)2) The complexity of the eigen decomposition is O (n)3). In the method, the complexity O (ndlog (n)) preprocessed by a spectral clustering algorithm is O (m) in the complexity of constructing a similarity matrix2) The complexity of the eigen decomposition is O (n). Wherein m is the number of data sets, n is the number of data, and d is the dimensionality of the data, so that the complexity can be found, and the method embodies the advantage of high efficiency.
Drawings
FIG. 1 is a schematic diagram of a tree structure obtained by K-dtree in the fast spectral clustering method based on a multi-scale data structure of the present invention.
FIG. 2 is a schematic diagram of an artificial data set in the multi-scale data structure-based fast spectral clustering method of the present invention.
FIG. 3 is a cluster structure diagram of the fast spectral clustering method based on multi-scale data structure.
FIG. 4 is a schematic diagram of a selected image from the Weizmann data set in the multi-scale data structure-based fast spectral clustering method according to the present invention.
FIG. 5 shows the results after five categories of the fast spectral clustering method based on multi-scale data structure according to the present invention.
FIG. 6 is the boundary of an object extracted according to the gold standard in the multi-scale data structure-based fast spectral clustering method of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Step 1: for input d-dimensional spatial data V ═ V1,v2,...,vnTherein of
Figure BDA0002014323270000061
Preprocessing data by adopting a K-d tree algorithm to obtain a series of data sets U-U1,u2,...,umWhere n is the number of data points, d is the dimensionality of the data, and m is the number of data sets, and a tree structure. The method comprises the following specific steps:
step 1.1 construct root node S0,S0The data in (1) is the whole data set V, the variance V on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxV is found and is set as maxTim (because the greater the variance is, the lower the coupling degree between the data is, the smaller the similarity between the data is),
step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V1And V2Two subsets (also into a plurality of subsets, here two subsets are taken as an example), the slicing is realized by a hyperplane passing through the slicing point and being perpendicular to the coordinate axis maxDim. Generating left and right child nodes with depth of 1 from root node
Figure BDA0002014323270000071
And
Figure BDA0002014323270000072
the left node corresponds to data in the coordinate maxDim that is smaller than the cut-off point, and the right node corresponds to data in the coordinate maxDim that is larger than the cut-off point.
Step 1.3 repeats the above process for each node, and stops if the maxV calculated by the node is less than a certain threshold or the node contains only one piece of data. All leaf nodes are found as the aforementioned data set. Here, the degree of coupling within each data set is high. And a tree structure as shown in fig. 1 is obtained. Wherein the content of the first and second substances,
Figure BDA0002014323270000073
where I denotes the depth of the tree and I denotes the node index of the current depth
Step 1.4. from the resulting data set U ═ U1,u2,...,umGet the transform matrix H ═ H based on the following equation1,h2,...,hmIn which h isi={h1i,h2i,...,hni}. Wherein | ujI represents a data set ujThe number of data in.
Figure BDA0002014323270000074
Step 2: and calculating a similarity matrix W between the sets. Wherein the kernel function used to calculate W is
Figure BDA0002014323270000075
In specific implementation, some sampling points are selected from the sets, and the similarity degree of the two sets is measured by the Euclidean distance between the sampling points.
Step 3, selecting depth of linitAs an initial layer of the K-d tree, sequentially traversing each node of the layer, and performing the following operations on each node:
step 3.1, calculating the similarity matrix of each node
Figure BDA0002014323270000076
Figure BDA0002014323270000077
Is a main sub-type of order s of W, i.e.
Figure BDA0002014323270000078
The specific method is to directly select corresponding rows and columns from W, wherein m1Is a node Sl-1The number of middle sets.
Step 3.2 is based on the formula
Figure BDA0002014323270000081
Obtaining a similarity matrix with lower dimensionality
Figure BDA0002014323270000082
For the initial layer, this QlIs of size m1×m1The identity matrix of (2).
Step 3.3 calculation
Figure BDA0002014323270000083
Of the "row" and "degree of acquisition matrix
Figure BDA0002014323270000084
Step 3.4 calculating Laplacian matrix
Figure BDA0002014323270000085
Step 3.5, calculating the first k eigenvectors of the Laplacian matrix to obtain m1The feature vector e with the size of x k, and the obtained feature vector is processed by a formula
Figure BDA0002014323270000086
The feature vector is converted back to the original space, and the obtained result is added to the father node Sl-1Of the conversion matrix Ql-1The addition method is as follows:
Figure BDA0002014323270000087
and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, and repeating the step 3 until the depth of the layer is 0 to obtain a feature vector matrix eventor corresponding to the first k feature values of the 0 th layer, namely the original data layer.
And 4, converting the feature vector back to the original data set space by making Y equal to H multiplied by Eectror _ R.
And 5, normalizing the Y by rows.
Step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a Fuzzy C-means algorithm, and clustering the characteristic vectors into k types of C1,C2,...,Ck
7, if the ith row of the Y belongs to the jth class, then the original data x is setiFall into class j.
It can be seen from the foregoing technical solutions that the method innovatively adopts a K-d tree algorithm to obtain a series of data sets, and replaces the original similarity matrix constructed based on data by calculating the similarity matrix between the sets, because the number m of the data sets is much smaller than the number n of the data, the dimension of the similarity matrix based on the sets will also be much smaller than that based on data.
In addition, the method also innovatively utilizes the tree structure of the K-d tree, and approximates a high-dimensional characteristic decomposition problem by using a plurality of low-dimensional characteristic decomposition problems, thereby intuitively solving the problem of large calculation amount of characteristic decomposition of the spectral clustering algorithm.
Mathematically, from the angle analysis of complexity, the complexity of the similarity matrix constructed by the original spectral clustering algorithm is O (n)2) The complexity of the eigen decomposition is O (n)3). In the method, the complexity O (ndlog (n)) preprocessed by a spectral clustering algorithm is O (m) in the complexity of constructing a similarity matrix2) The complexity of the eigen decomposition is O (n). Where m is the number of data sets, n is the number of data, and d is that of dataDimension, the method can show the advantage of high efficiency from the aspect of complexity.
Two examples are given here to show the reliability of the method from both effectiveness and efficiency points of view.
First is that one such method processes effects on a visualized artificial data set.
As shown in the artificial data set of figure 2,
1. the data set was first processed as described above, with a threshold of 0.001 for the K-d tree, resulting in a total of 126 data sets. The height of the tree structure obtained by the K-d tree is 10.
2. According to the obtained data set U ═ U1,u2,...,umGet the transform matrix H ═ H based on the following equation1,h2,...,hmIn which h isi={h1i,h2i,...,hni}. Wherein | ujI represents a data set ujThe number of data in.
Figure BDA0002014323270000091
For example, u1Four data in the set, corresponding to the 4 th to 7 th data in the original data set V, then h1=[0 00 0.5 0.5 0.5 0.5 0...0]T0.5 is by
Figure BDA0002014323270000092
And (4) obtaining the product.
2. And calculating a similarity matrix W between the sets. The kernel function used is
Figure BDA0002014323270000093
The value of the scale parameter σ is here represented by the document [4 ]]The self-adaptive spectral clustering algorithm selects some sampling points from the sets, and measures the similarity degree of the two sets by using Euclidean distance between the sampling points.
3. Selecting a layer with the depth of 4 as an initial layer of a k-d tree, sequentially traversing each node of the layer, and performing the following operations on each node:
3.1 calculating the similarity matrix of each node
Figure BDA0002014323270000094
Figure BDA0002014323270000095
Is a main sub-type of order s of W, i.e.
Figure BDA0002014323270000096
The specific method is to directly select corresponding rows and columns from W, wherein m1Is a node SlThe number of middle sets.
3.2 based on the formula
Figure BDA0002014323270000101
Obtaining a similarity matrix with lower dimensionality
Figure BDA0002014323270000102
For the initial layer, this QlIs of size m1×m1The identity matrix of (2).
3.3 calculation of
Figure BDA0002014323270000103
Of the "row" and "degree of acquisition matrix
Figure BDA0002014323270000104
3.4 computing Laplacian matrices
Figure BDA0002014323270000105
3.5 calculating the first 3 eigenvectors of the Laplacian matrix to obtain a vector with the size of m1E is multiplied by 3, and the obtained feature vector is processed by a formula
Figure BDA0002014323270000106
Convert the feature vector back to the original space, add the result to its parent SlIs converted into a matrixQl-1In (1). The addition method is as follows:
Figure BDA0002014323270000107
and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, namely the layer with the depth of 3, repeating the step 3 until the layer with the depth of 0 is obtained, and obtaining the eigenvector matrix eventor corresponding to the first 3 eigenvalues of the 0 th layer, namely the original data layer.
4. Let Y be H × Ector, convert the eigenvector back to the original dataset space.
5. Y is normalized by row.
6. Taking each line of Y as a data point, clustering the eigenvectors by using a Fuzzy C-means algorithm, and clustering the eigenvectors into 3 classes C1,C2,C3
7. If the ith row of Y belongs to the jth class, then the original data x is assignediFall into class j.
The experimental results are shown in the following chart:
this example demonstrates the effectiveness of the method.
As another example, we apply the method to skin datasets in UCI data, where 245057 are large-scale datasets. The same procedure as above is adopted to process the data, and the hardware equipment used is a personal computer which is configured as a CPU of Inter (R) core (TM) i7-3770 and 8GB internal memory. The time of the method on the data set is 4.31 seconds, the accuracy is 73.2%, and the traditional spectral clustering algorithm cannot run on the computer due to the large scale of the data.
As another example of image data, the image shown is selected from the We i zmann dataset. The image size is 300 × 225.
The format of the image data is processed so that the format of the data is [ pixel abscissa, pixel ordinate, pixel value ] for subsequent processing.
The data set was processed using the same method as for data, where the threshold of the k-d tree was 10, the number of classes was 5, and the initial depth was chosen to be 4.
FIGS. 5 and 6 show the results after five classes and the boundary of the object extracted according to the gold standard
The time taken for the method to be on the image is 1.56 seconds; however, the run time of the same conventional spectral clustering algorithm for pictures is 26.38 seconds.
The examples prove that the classification result of the method is good, the operation efficiency is very high, and the application range of spectral clustering is widened.
The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (4)

1. A fast spectral clustering method based on a multi-scale data structure is characterized by comprising the following steps:
step 1: for input d-dimensional spatial data V ═ V1,v2,...,vnTherein of
Figure FDA0002664970330000011
Preprocessing data by adopting a K-dtree algorithm to obtain a series of data sets U ═ U1,u2,...,umThe method comprises the following steps of (1) converting a matrix H and a tree structure, wherein n is the number of data points, d is the dimensionality of data, and m is the number of data sets;
step 2: calculating a similarity matrix W between the sets; wherein the kernel function used to calculate W is
Figure FDA0002664970330000012
Step 3, selecting a layer with the depth of l as an initial layer of the K-d tree, sequentially traversing each node of the layer, operating each node, and finally obtaining a feature vector observer of the root node;
step 4, converting the eigenvector back to the original data space by making Y equal to H multiplied by Ector;
step 5, normalizing the Y according to the rows;
step 6, taking each line of Y as a data point, clustering the characteristic vectors by using a Fuzzy C-means algorithm, and clustering the characteristic vectors into T-class C1,C2,…,CT
7, if the ith row of the Y belongs to the jth class, then the original data v is setiClassification into the jth class;
the step 1 comprises the following steps:
step 1.1 construct root node S0,S0The data in the data set is the whole data set V, the variance J on each dimension in the data set is calculated, the maximum dimension corresponding to the variance maxJ is found out and is set as maxTim;
step 1.2: taking the average number of the coordinate axes of the maxTim as a segmentation point, segmenting the original data set V into V1And V2The two subsets are divided by a hyperplane which passes through the division point and is vertical to the coordinate axis maxTim; generating left and right child nodes S with depth l from root node1 lAnd S2 l(ii) a The left node corresponds to the data of the coordinate maxDMi which is smaller than the dividing point, and the right node corresponds to the data of the coordinate maxDMi which is larger than the dividing point;
step 1.3, repeating the above process for each node, and stopping if maxJ calculated by the node is smaller than a certain threshold or the node only contains one piece of data; finding out all leaf nodes as the aforementioned data set; wherein the content of the first and second substances,
Figure FDA0002664970330000021
where l represents the depth of the tree and i represents the node index for the current depth;
step 1.4. from the resulting data set U ═ U1,u2,...,umGet the transform matrix H ═ H based on the following equation1,h2,...,hmIn which h isi={h1i,h2i,...,hni}; wherein | ujI represents a data set ujThe number of the middle data;
Figure FDA0002664970330000022
the step 3 comprises the following steps:
step 3.1, calculating the similarity matrix of each node
Figure FDA0002664970330000023
Figure FDA0002664970330000024
Is a main sub-type of order s of W, i.e.
Figure FDA0002664970330000025
The specific method is to directly select corresponding rows and columns from W, wherein m1Is a node Sl-1The number of middle sets;
step 3.2 is based on the formula
Figure FDA0002664970330000026
Obtaining a similarity matrix with lower dimensionality
Figure FDA0002664970330000027
For the initial layer, this QlIs of size m1×m1The identity matrix of (1);
step 3.3 calculation
Figure FDA0002664970330000028
Of the "row" and "degree of acquisition matrix
Figure FDA0002664970330000029
Step 3.4 calculating Laplacian matrix
Figure FDA00026649703300000210
Step 3.5 before calculation of Laplacian matrixk feature vectors to obtain m1Feature vector E of size x klPassing the obtained feature vector through a formula
Figure FDA00026649703300000211
The feature vector is converted back to the original space, and the obtained result is added to the father node Sl-1Of the conversion matrix Ql-1The addition method is as follows:
Figure FDA00026649703300000212
and 3.6, after all the nodes in the layer are calculated, making l equal to l-1, switching to the previous layer of the tree, and repeating the step 3 until the depth of the layer is 0 to obtain a feature vector matrix eventor corresponding to the first k feature values of the 0 th layer, namely the original data layer.
2. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of claim 1 are performed when the program is executed by the processor.
3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 1.
4. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of claim 1.
CN201910257841.3A 2019-04-01 2019-04-01 Rapid spectral clustering method based on multi-scale data structure Active CN109978066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910257841.3A CN109978066B (en) 2019-04-01 2019-04-01 Rapid spectral clustering method based on multi-scale data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910257841.3A CN109978066B (en) 2019-04-01 2019-04-01 Rapid spectral clustering method based on multi-scale data structure

Publications (2)

Publication Number Publication Date
CN109978066A CN109978066A (en) 2019-07-05
CN109978066B true CN109978066B (en) 2020-10-30

Family

ID=67082181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910257841.3A Active CN109978066B (en) 2019-04-01 2019-04-01 Rapid spectral clustering method based on multi-scale data structure

Country Status (1)

Country Link
CN (1) CN109978066B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652734B (en) * 2020-07-13 2021-06-29 深圳橙色魔方信息技术有限公司 Financial information management system based on block chain and big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197504B1 (en) * 1999-04-23 2007-03-27 Oracle International Corporation System and method for generating decision trees
CN101211355A (en) * 2006-12-30 2008-07-02 中国科学院计算技术研究所 Image inquiry method based on clustering
CN102774325A (en) * 2012-07-31 2012-11-14 西安交通大学 Rearview reversing auxiliary system and method for forming rearview obstacle images
US8798357B2 (en) * 2012-07-09 2014-08-05 Microsoft Corporation Image-based localization
CN105893389A (en) * 2015-01-26 2016-08-24 阿里巴巴集团控股有限公司 Voice message search method, device and server
CN109299339A (en) * 2018-11-26 2019-02-01 辽宁工程技术大学 A kind of quick Spectral Clustering chosen based on improvement kd tree mark point

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197504B1 (en) * 1999-04-23 2007-03-27 Oracle International Corporation System and method for generating decision trees
CN101211355A (en) * 2006-12-30 2008-07-02 中国科学院计算技术研究所 Image inquiry method based on clustering
US8798357B2 (en) * 2012-07-09 2014-08-05 Microsoft Corporation Image-based localization
CN102774325A (en) * 2012-07-31 2012-11-14 西安交通大学 Rearview reversing auxiliary system and method for forming rearview obstacle images
CN105893389A (en) * 2015-01-26 2016-08-24 阿里巴巴集团控股有限公司 Voice message search method, device and server
CN109299339A (en) * 2018-11-26 2019-02-01 辽宁工程技术大学 A kind of quick Spectral Clustering chosen based on improvement kd tree mark point

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Automatically finding clusters in normalized cuts;Mariano Tepper 等;《Pattern Recognition》;20110731(第7期);第1-2页 *
三维离子通道内输运过程的有限元模拟;陈旻昕 等;《中国化学会会议论文集》;20140804;第1-2页 *

Also Published As

Publication number Publication date
CN109978066A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
Yang et al. A feature-reduction multi-view k-means clustering algorithm
Zass et al. A unifying approach to hard and probabilistic clustering
Hwang et al. Hierarchical discriminant regression
US7542954B1 (en) Data classification by kernel density shape interpolation of clusters
CN111191698B (en) Clustering method based on nonnegative matrix factorization and fuzzy C-means
Lu et al. Traffic sign recognition via multi-modal tree-structure embedded multi-task learning
Reza et al. ICA and PCA integrated feature extraction for classification
CN109241813B (en) Non-constrained face image dimension reduction method based on discrimination sparse preservation embedding
Khan et al. A framework for head pose estimation and face segmentation through conditional random fields
Twum et al. Textural Analysis for Medicinal Plants Identification Using Log Gabor Filters
CN109978066B (en) Rapid spectral clustering method based on multi-scale data structure
CN110209895B (en) Vector retrieval method, device and equipment
Dhanalakshmi et al. A novel method for image processing using particle swarm optimization technique
CN107563287B (en) Face recognition method and device
Sotiropoulos Handling variable shaped & high resolution images for multi-class classification problem
CN113225300B (en) Big data analysis method based on image
CN112800138B (en) Big data classification method and system
Fujita A clustering method for data in cylindrical coordinates
CN110543816B (en) Self-adaptive face image clustering method based on spectral clustering and reinforcement learning
CN112241680A (en) Multi-mode identity authentication method based on vein similar image knowledge migration network
Wang et al. An Efficient Data Preprocessing Procedure for Support Vector Clustering.
Berikov et al. Similarity-based decision tree induction method and its application to cancer recognition on tomographic images
Busch et al. Implicit Hough Transform Neural Networks for Subspace Clustering
Goh Riemannian manifold clustering and dimensionality reduction for vision-based analysis
Sasireka et al. Effective image data points extraction via decision tree based on feature classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant