CN105160352A - High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution - Google Patents
High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution Download PDFInfo
- Publication number
- CN105160352A CN105160352A CN201510504284.2A CN201510504284A CN105160352A CN 105160352 A CN105160352 A CN 105160352A CN 201510504284 A CN201510504284 A CN 201510504284A CN 105160352 A CN105160352 A CN 105160352A
- Authority
- CN
- China
- Prior art keywords
- dimension
- subspace
- clustering
- data
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000000694 effects Effects 0.000 title claims abstract description 37
- 238000012216 screening Methods 0.000 claims abstract description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 30
- 238000012545 processing Methods 0.000 claims description 20
- 230000009191 jumping Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 6
- 238000003064 k means clustering Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 9
- 238000012800 visualization Methods 0.000 description 7
- 241000282414 Homo sapiens Species 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000002775 capsule Substances 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000003930 cognitive ability Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution. The high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution comprises the specific steps of: 1, searching a dimension subspace, i.e. confirming a target optimization subspace of which a two-dimensional projection effect needs to be improved and selecting a subspace with an excellent clustering structure; 2, constructing a reconstructed dimension set, i.e. transferring clustering information of the subspace with the excellent clustering structure to a reconstructed dimension; 3, constructing candidate optimal dimension subspace sets, i.e. carrying out free combination on each element of the reconstructed dimension set and the target optimization dimension subspace to generate the candidate optimal subspace sets; 4, screening an optimal dimension subspace set; 5, determining an optimal dimension subspace. According to the high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution, the reconstruction concept is creatively introduced into the high-dimensional data subspace, the clustering projection effect of the target optimization subspace is improved and enhanced by the reconstructed dimension with stronger clustering information, and the problem of distortion of the clustering projection effect of the high-dimensional data subspace on the two-dimensional plane is solved.
Description
Technical Field
The invention relates to the technical field of high-dimensional data analysis, processing and visualization, in particular to a method for optimizing subspace clustering projection effect by using relevant concepts and technologies such as subspace clustering, LDA (linear discriminant analysis), MDS (multidimensional scaling) and Dunn indexes.
Background
With the rapid development of computer technology in various industries, various data are increasing, and a large amount of data in the data are multidimensional data and even high-dimensional data. In view of the limitations of human cognitive ability and the lack of imagination of high-dimensional data space, it is still difficult for human beings to obtain deep information embedded in complex high-dimensional data. An optimal result state that can be achieved by the clustering method is that the similarity between objects belonging to the same cluster is as high as possible, while the similarity between objects belonging to different clusters is as low as possible. Through cluster analysis, people can obtain the knowledge of data hiding under the condition that the rules in a large amount of data are not clear. The noise level increases when the dimensionality of the data is high, the density of the data becomes more sparse, distance-based metrics become ineffective, and other adverse effects, such anomalies that manifest themselves as increasing dimensionality are called "dimensionality disasters". Therefore, how to process high-dimensional data is widely concerned and becomes a hot research problem.
Visualization technology is a visual perception technology that helps us understand data. By visualization, the oxford english dictionary is interpreted as "the ability or process to compose a mental context, or the vision of something that is not directly perceptible". The term also refers to the process of generating a visible image of what would otherwise be invisible. It is pointed out that: visualization is a series of transformations that convert raw analog data into a displayable image, the purpose of which is to convert the information into a format that is perceived by the human sensory system. The visualization technology is applied to basically all scientific research fields at present, and is a new subject in multiple fields such as computer graphics, signal processing, man-machine interaction, artificial intelligence and the like. The theoretical method of the visualization technology is applied to the field of pattern recognition, so that the flexibility and the innovation of human beings can be brought into full play. The visualization technology method can be used as an intermediate between abstract data and a user, provides the overall information of the data for the user, and helps the user determine interesting contents.
A subspace refers to a set reconstructed in dimensions from the original high-dimensional dataset, and its constructed dimensions may be partially repeated or completely different. Clustering is a common data analysis tool, and aims to divide a collection of large numbers of data points into several classes, so that the data in each class are maximally similar, while the data in different classes are maximally different. The dimensionality of the clustered data is very high, reaching hundreds or even thousands of dimensions, and clustering over such a high-dimensional space is a challenging problem. The reason is analyzed, and the method mainly comprises the following three points that 1) the essence of clustering is an unsupervised learning problem, and a plurality of supervised learning algorithms cannot be used; 2) in such a high-dimensional space, the distance between instances is governed by a large number of irrelevant attributes, which may cause the instances with close values of relevant attributes to be far apart, and the clustering result is not ideal; 3) the higher the dimensionality, the more computationally intensive, due to dimensionality disasters. For the above problem, a dimension reduction method is usually adopted to map high-dimensional data to a low-dimensional space for subspace selection, and then clustering is performed on the subspace. Commonly used dimensionality reduction methods include Principal Component Analysis (PCA), Multi-dimensional scaling (MDS), Local Linear Embedding (LLE), Linear Discriminant Analysis (LDA), and the like. The invention uses two dimensionality reduction technologies of MDS and LDA to carry out different combination processing on the high-dimensional data set.
Dunn separability index is a clustering validity function based on a geometric structure, and Dunn introduces a hard clustering validity function by using compactness and separability of a data set
Wherein,define class CiAnd class CjOf the distance between, and diam (C)k) Define a cluster CkDiameter ofIt is clear that a large Dunn index value indicates that the data set contains clusters with good closeness and separation.
Disclosure of Invention
The main objective of the invention is to optimize the high dimensional data subspace (dimension subspace) clustering projection effect. Aiming at the condition that the projection effect of the high-dimensional data subspace is not ideal, the invention provides a method capable of improving the projection effect.
The design idea of the invention is as follows: based on the idea of dimension reconstruction, by means of two dimension reduction technologies of MDS and LDA, necessary clustering information is collected from an original high-dimensional data subspace, the clustering information is constructed into a new dimension to be introduced into the original high-dimensional data subspace, an optimization subspace carrying stronger clustering information is formed, and then the clustering projection effect of the optimization subspace is better than that of the original subspace.
In order to achieve the technical purpose, the technical scheme of the invention is that,
a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction comprises the following steps:
step 1): exploring the dimension subspace: selecting a target optimization dimension subspace with poor clustering structure information, namely poor two-dimensional projection effect, and a plurality of dimension subspaces with good clustering structure information, namely good projection effect, from an original data set;
step 2): constructing a reconstruction dimension set: projecting the data objects of a plurality of dimension subspaces with good clustering structures obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set;
step 3): constructing a candidate optimization dimension subspace set: selecting one or more reconstruction dimensions in the reconstruction dimension set and a target optimization dimension subspace to be freely combined according to the reconstruction dimension set obtained in the step 2), and forming a candidate optimization dimension subspace set by all combinations;
step 4): screening out an optimized dimension subspace set: according to the candidate optimization dimension subspace set obtained in the step 3), for each candidate optimization dimension subspace in the set, firstly, projecting a data object in the candidate optimization dimension subspace to a two-dimensional plane through CMDS algorithm processing; then applying a K mean value clustering algorithm to all data points on the two-dimensional plane; finally, calculating a Dunn index of a clustering result, and if the Dunn index is larger than a certain preset threshold Q, determining the candidate optimized dimension subspace as an optimized dimension subspace;
step 5): determining an optimal dimension subspace: according to the optimized dimension subspace set determined in the step 4), if the set is empty, no optimal dimension subspace exists; otherwise, selecting the optimal dimension subspace corresponding to the maximum Dunn index from the set as the optimal dimension subspace.
The high-dimensional data subspace clustering projection effect optimization method based on the dimension reconstruction comprises the following steps in the step 1):
step 1.1): calculating a set of dimension points of a two-dimensional plane of the original dataset: firstly, calculating Pearson correlation coefficients of all dimensions of an original data set; then processing the obtained Pearson correlation coefficient through a CMDS algorithm to enable a dimension object of the original data set to be projected to a two-dimensional plane;
step 1.2): selecting a target optimization dimension subspace: processing the dimension point set on the two-dimensional plane by the original data set obtained in the step 1.1) through a K-means clustering algorithm to obtain clusters of K dimension points; then, the clusters of the K dimension points are used as K candidate dimension subspaces, and data objects of the candidate target optimization dimension subspaces are processed one by one through the CMDS; finally, calculating Dunn indexes of data points of K candidate dimension subspaces on a two-dimensional plane, and selecting the candidate dimension subspace with the minimum Dunn index and Dunn < N as a target optimization subspace, otherwise, not having the target optimization subspace, and directly ending the whole projection effect optimization process;
step 1.3): selecting a plurality of dimension subspaces with good clustering structures: according to the Dunn indexes of K candidate dimension subspaces obtained by calculation in the step 1.2), firstly setting a dimension subspace threshold value W for screening out dimension subspace information which meets the user requirement and has good clustering structure information, wherein the larger the value of W is, the stronger the screened dimension subspace clustering structure information is; all candidate dimension subspaces of Dunn > W are then selected as the dimension subspace with good cluster structure.
The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 2) comprises the following steps:
step 2.1): selecting a dimension subspace with a good clustering structure, and selecting an unselected dimension subspace from the plurality of dimension subspaces with good clustering obtained in the step 1.3);
step 2.2): calculating two-dimensional space data points, projecting all data objects in the dimensional subspace to a two-dimensional plane after the data objects are subjected to CMDS algorithm and normalization processing according to the dimensional subspace with the good clustering structure selected in the step 2.1);
step 2.3): calculating a discrimination straight line, and obtaining the discrimination straight line corresponding to the data point in the dimensionality subspace after processing the two-dimensional space data point calculated in the step 2.2) through an LDA algorithm;
step 2.4): calculating a projection value, and according to the two-dimensional plane data point calculated in the step 2.2) and the discrimination straight line calculated in the step 2.3), calculating the projection value according to the following projection formula:
projecting the two-dimensional plane data points one by one onto each discrimination line, and obtaining the value ranges of x and y of the two-dimensional plane data points obtained in the step 2.2) as [ -1,1]Then the discrimination line can be represented by a line segment of the line cut by a square having a (0,0) center, a side length of 2, and each side being either perpendicular or parallel to the coordinate axis, according to equation (1), where O represents the midpoint of the discrimination line segment, M represents any one data point on the two-dimensional plane, M' represents the projection point of M on the discrimination line, then MrdThe projection value of the data point M on the discrimination line;
step 2.5): normalizing the projection values, wherein the projection value sequence of each discrimination line obtained in the step 2.4) needs to pass through the following formula:
performing data normalization process, wherein xidRepresenting a single projection value, mudMean, σ, representing a list of projection valuesdA standard deviation representing a list of projection values;
step 2.6): constructing a reconstruction dimension, calculating a standardized projection value of the two-dimensional plane data point on each judgment straight line according to the step 2.5), sequencing a projection value list corresponding to each judgment straight line according to the original sequence of the two-dimensional space data points to obtain a sequence which is the reconstruction dimension, and jumping to the step 2.1) until all the dimension subspaces with good clustering structures are processed.
The high-dimensional data subspace clustering projection effect optimization method based on the dimension reconstruction comprises the following steps in the step 3):
step 3.1): calculating a set C of all non-empty subsets of the set according to the set of reconstruction dimensions determined in step 2)rd;
Step 3.2): optimizing a set D of dimensional subspaces according to the objective determined in step 1)od;
Step 3.3): according to the collective Cartesian product Crd×DodThe set obtained by calculation is the candidate optimization subspace set Kcos。
The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 4) comprises the following steps:
step 4.1): order optimization dimensionDegree subspaceAnd setting a threshold Q;
step 4.2): calculating to obtain a candidate optimized dimension subspace set K according to the step 3.3)cosIf, ifSelecting a candidate optimized dimension subspace scoWherein s isco∈KcosThen candidate optimized dimension subspace set KcosSubtracting the candidate optimized dimension subspace scoI.e. Kcos=Kcos-scoJumping to the step 4.3) to continue execution; if it isThen represents the optimized dimension subspace KosAfter the calculation is finished, jumping to the step 5) to continue executing;
step 4.3) optimizing the dimension subspace s selected according to step 4.1)coObtaining a scatter diagram of a two-dimensional plane after being processed by an MDS algorithm, then applying a K-means clustering algorithm to data points of the scatter diagram, and calculating the Dunn index of the clusterIf it isThe candidate dimension is subspace scoAdding an optimized dimension subspace set KosI.e. Kos=Kos+scoThen jump to step 4.2) to continue execution.
The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 5) comprises the following steps:
the optimized dimension subspace K calculated according to the step 4)osSelecting the corresponding Dunn indexThe largest optimized dimension subspace is used as the optimal dimension subspace smo
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a diagram of a specific implementation of the method of the present invention;
FIG. 3 is a conceptual illustration of the method of the present invention;
FIG. 4 is a schematic diagram of a dimension reconstruction method according to the method of the present invention;
FIG. 5 is a software tool schematic of the method of the present invention.
Detailed Description
In order that the objects, design considerations and advantages of the present invention will become more apparent, the invention will be further described in detail in the following section, taken in conjunction with the accompanying drawings.
The invention provides a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction, which comprises the following five main steps as shown in figure 1: step 1): exploring the dimension subspace: selecting a target optimization dimension subspace with poor clustering structure information, namely poor two-dimensional projection effect, and a plurality of dimension subspaces with good clustering structure information, namely good projection effect, from an original data set; step 2): constructing a reconstruction dimension set: projecting the data objects of the plurality of good clustering dimension subspaces obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set; step 3): constructing a candidate optimization dimension subspace set: selecting one or more reconstruction dimensions in the reconstruction dimension set and a target optimization dimension subspace to be freely combined according to the reconstruction dimension set obtained in the step 2), and forming a candidate optimization dimension subspace set by all combinations; step 4): screening out an optimized dimension subspace set: according to the candidate optimization dimension subspace set obtained in the step 3), for each candidate optimization dimension subspace in the set, firstly, projecting a data object in the candidate optimization dimension subspace to a two-dimensional plane through CMDS algorithm processing; then applying a K mean value clustering algorithm to all data points on the two-dimensional plane; finally, calculating a Dunn index of a clustering result, and if the Dunn index is larger than a certain preset threshold Q (the threshold Q is determined according to the requirements of users), determining the candidate optimized dimension subspace as an optimized dimension subspace; step 5): determining an optimal dimension subspace: according to the optimized dimension subspace set determined in the step 4), if the set is empty, no optimal dimension subspace exists; otherwise, selecting the optimal dimension subspace corresponding to the maximum Dunn index from the set as the optimal dimension subspace.
The key steps involved in the method of the invention are explained in detail one by one, and the specific steps are as follows:
step one, loading a data set. In the implementation process of the method, a USDAFoodComposionData data set (the data set consists of 18 dimensions and 722 sample points) is selected as an experimental data set.
And step two, exploring the dimension subspace.
Calculating a dimension point set of a two-dimensional plane of the experimental data set. Firstly, calculating Pearson correlation coefficients of all dimensions of an original data set; and then processing the obtained Pearson correlation coefficient through a CMDS algorithm so that the dimensional object of the original data set is projected to a two-dimensional plane. According to the characteristics of the Pearson correlation coefficient, if the correlation between the dimensions is stronger, the dimension points on the two-dimensional plane are closer.
Selecting a target optimization dimension subspace. Firstly, performing K-means clustering processing on a dimension point set of the original data set obtained in the step 1.1) on a two-dimensional plane, and obtaining clusters of K dimension points; then, the clusters of the K dimension points are used as K candidate dimension subspaces, and data objects of the candidate target optimization dimension subspaces are processed one by one through a CMDS algorithm; and finally, calculating Dunn indexes of data points of the K candidate dimension subspaces on a two-dimensional plane, and selecting the candidate dimension subspace with the minimum Dunn index and Dunn < N (the threshold value N is determined according to the requirements of a user) as a target optimization subspace, otherwise, directly ending the whole projection effect optimization process without the target optimization subspace.
And thirdly, selecting a plurality of dimension subspaces with good clustering structures. According to the Dunn indexes of the K candidate dimension subspaces obtained by calculation in the step 1.2), firstly setting a threshold value W, wherein the threshold value is used for helping a user to screen out a dimension subspace which meets the user requirement and has good clustering structure information, and the larger the value of W is, the stronger the clustering structure information of the screened dimension subspace is; all candidate dimension subspaces of Dunn > W are then selected as the dimension subspace with good cluster structure.
And step three, determining a target optimization dimension subspace set. As shown in fig. 5- (a) and 5- (b), the dimension subspace 3 and the dimension subspace 4 are determined to optimize the dimension subspace as a target and constitute a set. Otherwise, ending.
And step four, determining a dimension subspace set with a good clustering structure. As shown in fig. 5- (a) and 5- (b), the dimension subspace 1 and the dimension subspace 2 are determined to be dimension subspaces having a good cluster structure and constitute a set. Otherwise, ending.
And step five, constructing a reconstruction dimension set. According to the dimension subspace with a good clustering structure found in the fourth step, in order to utilize the clustering information in the dimension subspaces, the clustering information in the subspaces can be effectively transferred to the reconstruction dimension by adopting the idea of reconstructing the dimension and certain algorithm steps, and the specific steps are as follows:
firstly, projecting a data object of a dimensionality subspace with a good clustering structure to a two-dimensional plane through a CMDS algorithm;
processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines;
and thirdly, projecting all data points on the two-dimensional plane to corresponding discrimination lines, so that a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension.
As a result, as shown in FIG. 5- (c), RD-A and RD-B represent the reconstruction dimensions generated in dimension subspace 1 and dimension subspace 2, respectively, and together constitute a set of reconstruction dimensions.
And step six, constructing a candidate optimization dimension subspace set. And according to the target optimization dimension subspace determined in the third step and the reconstruction dimension set determined in the fifth step, carrying out Cartesian product operation on the target optimization subspace and the reconstruction dimension set to form a candidate optimization dimension subspace set.
And seventhly, judging whether the candidate optimization dimension subspace set is empty or not, and performing iterative verification.
If the set is not empty.
The controller selects a candidate optimized dimension subspace from the set and removes the element from the set.
And (3) performing projection verification on the candidate optimized subspace through the capsule wall, comparing and observing whether the projection effect is improved, if so, jumping to 3 for further verification, and otherwise, jumping to the sixth step for continuous execution.
⒊ determine Dunn values for the planar data points after projection verification in calculation 2.
⒋, determines whether the maximum Dunn has been set. And if the setting is already set, jumping to 5 to continue execution, otherwise, jumping to 6 to continue execution.
⒌, comparing the Dunn value with the maximum Dunn value, if the Dunn value is larger than the maximum Dunn value, jumping to 6 to continue executing, otherwise, jumping to step six to continue executing.
⒍ sets the currently calculated Dunn value to the maximum Dunn value and sets the current candidate optimized dimension subspace to the optimal subspace.
② if the collection is empty. And finishing the whole optimized projection process.
Claims (6)
1. A high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction is characterized by comprising the following steps:
step 1): exploring the dimension subspace: selecting a target optimization dimension subspace with poor clustering structure information, namely poor two-dimensional projection effect, and a plurality of dimension subspaces with good clustering structure information, namely good projection effect, from an original data set;
step 2): constructing a reconstruction dimension set: projecting the data objects of the plurality of good clustering dimension subspaces obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set;
step 3): constructing a candidate optimization dimension subspace set: selecting one or more reconstruction dimensions in the reconstruction dimension set and a target optimization dimension subspace to be freely combined according to the reconstruction dimension set obtained in the step 2), and forming a candidate optimization dimension subspace set by all combinations;
step 4): screening out an optimized dimension subspace set: according to the candidate optimization dimension subspace set obtained in the step 3), for each candidate optimization dimension subspace in the set, firstly, projecting a data object in the candidate optimization dimension subspace to a two-dimensional plane through CMDS algorithm processing; then applying a K mean value clustering algorithm to all data points on the two-dimensional plane; finally, calculating a Dunn index of a clustering result, and if the Dunn index is larger than a certain preset threshold Q, determining the candidate optimized dimension subspace as an optimized dimension subspace;
step 5): determining an optimal dimension subspace: according to the optimized dimension subspace set determined in the step 4), if the set is empty, no optimal dimension subspace exists; otherwise, selecting the optimal dimension subspace corresponding to the maximum Dunn index from the set as the optimal dimension subspace.
2. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 1) comprises the following steps:
step 1.1): calculating a set of dimension points of a two-dimensional plane of the original dataset: firstly, calculating Pearson correlation coefficients of all dimensions of an original data set; then processing the obtained Pearson correlation coefficient through a CMDS algorithm to enable a dimension object of the original data set to be projected to a two-dimensional plane;
step 1.2): selecting a target optimization dimension subspace: processing the dimension point set on the two-dimensional plane by the original data set obtained in the step 1.1) through a K-means clustering algorithm to obtain clusters of K dimension points; then, the clusters of the K dimension points are used as K candidate dimension subspaces, and data objects of the candidate target optimization dimension subspaces are processed one by one through the CMDS; finally, calculating Dunn indexes of data points of K candidate dimension subspaces on a two-dimensional plane, and selecting the candidate dimension subspace with the minimum Dunn index and Dunn < N as a target optimization subspace, otherwise, not having the target optimization subspace, and directly ending the whole projection effect optimization process;
step 1.3): selecting a plurality of dimension subspaces with good clustering structures: according to the Dunn indexes of K candidate dimension subspaces obtained by calculation in the step 1.2), firstly setting a dimension subspace threshold value W for screening out dimension subspace information which meets the user requirement and has good clustering structure information, wherein the larger the value of W is, the stronger the screened dimension subspace clustering structure information is; all candidate dimension subspaces of Dunn > W are then selected as the dimension subspace with good cluster structure.
3. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 2) comprises the following steps:
step 2.1): selecting a dimension subspace with a good clustering structure, and selecting an unselected dimension subspace from the plurality of dimension subspaces with good clustering obtained in the step 1.3);
step 2.2): calculating two-dimensional space data points, projecting all data objects in the dimensional subspace to a two-dimensional plane after the data objects are subjected to CMDS algorithm and normalization processing according to the dimensional subspace with the good clustering structure selected in the step 2.1);
step 2.3): calculating a discrimination straight line, and obtaining the discrimination straight line corresponding to the data point in the dimensionality subspace after processing the two-dimensional space data point calculated in the step 2.2) through an LDA algorithm;
step 2.4): calculating a projection value, and according to the two-dimensional plane data point calculated in the step 2.2) and the discrimination straight line calculated in the step 2.3), calculating the projection value according to the following projection formula:
projecting the two-dimensional plane data points one by one onto each discrimination line, and obtaining the value ranges of x and y of the two-dimensional plane data points obtained in the step 2.2) as [ -1,1]Then the discrimination line can be represented by a line segment of the line cut by a square having a (0,0) center, a side length of 2, and each side being either perpendicular or parallel to the coordinate axis, according to equation (1), where O represents the midpoint of the discrimination line segment, M represents any one data point on the two-dimensional plane, M' represents the projection point of M on the discrimination line, then MrdThe projection value of the data point M on the discrimination line;
step 2.5): normalizing the projection values, wherein the projection value sequence of each discrimination line obtained in the step 2.4) needs to pass through the following formula:
performing data normalization process, wherein xidRepresenting a single projection value, mudMean, σ, representing a list of projection valuesdA standard deviation representing a list of projection values;
step 2.6): constructing a reconstruction dimension, calculating a standardized projection value of the two-dimensional plane data point on each judgment straight line according to the step 2.5), sequencing a projection value list corresponding to each judgment straight line according to the original sequence of the two-dimensional space data points to obtain a sequence which is the reconstruction dimension, and jumping to the step 2.1) until all the dimension subspaces with good clustering structures are processed.
4. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 3) comprises the following steps:
step 3.1): calculating a set C of all non-empty subsets of the set according to the set of reconstruction dimensions determined in step 2)rd;
Step 3.2): optimizing a set D of dimensional subspaces according to the objective determined in step 1)od;
Step 3.3): according to the collective Cartesian product Crd×DodThe set obtained by calculation is the candidate optimization subspace set Kcos。
5. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 4) comprises the following steps:
step 4.1): order to optimize a dimension subspaceAnd setting a threshold Q;
step 4.2): calculating to obtain a candidate optimized dimension subspace set K according to the step 3.3)cosIf, ifSelecting a candidate optimization dimensionSubspace scoWherein s isco∈KcosThen candidate optimized dimension subspace set KcosSubtracting the candidate optimized dimension subspace scoI.e. Kcos=Kcos-scoJumping to the step 4.3) to continue execution; if it isThen represents the optimized dimension subspace KosAfter the calculation is finished, jumping to the step 5) to continue executing;
step 4.3) optimizing the dimension subspace s selected according to step 4.1)coObtaining a scatter diagram of a two-dimensional plane after being processed by an MDS algorithm, then applying a K-means clustering algorithm to data points of the scatter diagram, and calculating the Dunn index of the clusterIf it isThe candidate dimension is subspace scoAdding an optimized dimension subspace set KosI.e. Kos=Kos+scoThen jump to step 4.2) to continue execution.
6. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 5) comprises the following steps:
the optimized dimension subspace K calculated according to the step 4)osSelecting the corresponding Dunn indexThe largest optimized dimension subspace is used as the optimal dimension subspace smo。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510504284.2A CN105160352A (en) | 2015-08-18 | 2015-08-18 | High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510504284.2A CN105160352A (en) | 2015-08-18 | 2015-08-18 | High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105160352A true CN105160352A (en) | 2015-12-16 |
Family
ID=54801204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510504284.2A Pending CN105160352A (en) | 2015-08-18 | 2015-08-18 | High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105160352A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105631805A (en) * | 2016-03-05 | 2016-06-01 | 陈晋飞 | High-dimensional vision generation method |
CN107330452A (en) * | 2017-06-16 | 2017-11-07 | 悦享趋势科技(北京)有限责任公司 | Clustering method and device |
CN108021664A (en) * | 2017-12-04 | 2018-05-11 | 北京工商大学 | A kind of multidimensional data correlation visual analysis method and system based on dimensional projections |
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
CN109903852A (en) * | 2019-01-18 | 2019-06-18 | 杭州电子科技大学 | Based on the customized intelligent Epileptic Prediction of PCA-LDA |
CN110188098A (en) * | 2019-04-26 | 2019-08-30 | 浙江大学 | A kind of high dimension vector data visualization method and system based on the double-deck anchor point figure projection optimization |
CN111950651A (en) * | 2020-08-21 | 2020-11-17 | 中国科学院计算机网络信息中心 | High-dimensional data processing method and device |
CN116049648A (en) * | 2022-11-17 | 2023-05-02 | 北京东方通科技股份有限公司 | Multiparty projection method and multiparty data analysis method based on data security |
-
2015
- 2015-08-18 CN CN201510504284.2A patent/CN105160352A/en active Pending
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105631805B (en) * | 2016-03-05 | 2018-09-04 | 陈晋飞 | A kind of production method of higher-dimension vision |
CN105631805A (en) * | 2016-03-05 | 2016-06-01 | 陈晋飞 | High-dimensional vision generation method |
CN107330452A (en) * | 2017-06-16 | 2017-11-07 | 悦享趋势科技(北京)有限责任公司 | Clustering method and device |
CN108021664A (en) * | 2017-12-04 | 2018-05-11 | 北京工商大学 | A kind of multidimensional data correlation visual analysis method and system based on dimensional projections |
CN108021664B (en) * | 2017-12-04 | 2020-05-05 | 北京工商大学 | Multidimensional data correlation visual analysis method and system based on dimension projection |
CN109344194B (en) * | 2018-09-20 | 2021-09-28 | 北京工商大学 | Subspace clustering-based pesticide residue high-dimensional data visual analysis method and system |
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
CN109903852A (en) * | 2019-01-18 | 2019-06-18 | 杭州电子科技大学 | Based on the customized intelligent Epileptic Prediction of PCA-LDA |
CN110188098A (en) * | 2019-04-26 | 2019-08-30 | 浙江大学 | A kind of high dimension vector data visualization method and system based on the double-deck anchor point figure projection optimization |
CN110188098B (en) * | 2019-04-26 | 2021-02-19 | 浙江大学 | High-dimensional vector data visualization method and system based on double-layer anchor point map projection optimization |
CN111950651A (en) * | 2020-08-21 | 2020-11-17 | 中国科学院计算机网络信息中心 | High-dimensional data processing method and device |
CN111950651B (en) * | 2020-08-21 | 2024-02-09 | 中国科学院计算机网络信息中心 | High-dimensional data processing method and device |
CN116049648A (en) * | 2022-11-17 | 2023-05-02 | 北京东方通科技股份有限公司 | Multiparty projection method and multiparty data analysis method based on data security |
CN116049648B (en) * | 2022-11-17 | 2023-08-04 | 北京东方通科技股份有限公司 | Multiparty projection method and multiparty data analysis method based on data security |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105160352A (en) | High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution | |
Yang et al. | A survey on canonical correlation analysis | |
Van Hulle | Self-organizing Maps. | |
Liu et al. | Localized sparse incomplete multi-view clustering | |
Bendich et al. | Persistent homology analysis of brain artery trees | |
Thrun | Projection-based clustering through self-organization and swarm intelligence: combining cluster analysis with the visualization of high-dimensional data | |
Zhu et al. | Evaluating spatiotemporal interest point features for depth-based action recognition | |
Wang et al. | A perception-driven approach to supervised dimensionality reduction for visualization | |
Zhang et al. | Learning object-to-class kernels for scene classification | |
Koch | Analysis of multivariate and high-dimensional data | |
CN1316419C (en) | Prediction by collective likelihood from emerging patterns | |
CN107122752B (en) | Human body action comparison method and device | |
Vathy-Fogarassy et al. | Graph-based clustering and data visualization algorithms | |
US8983141B2 (en) | Geophysical data texture segmentation using double-windowed clustering analysis | |
Zhao et al. | Binary SIPPER plankton image classification using random subspace | |
CN108304573A (en) | Target retrieval method based on convolutional neural networks and supervision core Hash | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
Yin et al. | Adaptive nonlinear manifolds and their applications to pattern recognition | |
CN105549885A (en) | Method and device for recognizing user emotion during screen sliding operation | |
CN114118165A (en) | Multi-modal emotion data prediction method and device based on electroencephalogram and related medium | |
Zhang et al. | Locality-constrained affine subspace coding for image classification and retrieval | |
Poelmans et al. | Text mining with emergent self organizing maps and multi-dimensional scaling: A comparative study on domestic violence | |
Scrucca et al. | Projection pursuit based on Gaussian mixtures and evolutionary algorithms | |
Ali et al. | Towards visual exploration of large temporal datasets | |
Tang et al. | Using a vertical-stream variational auto-encoder to generate segment-based images and its biological plausibility for modelling the visual pathways |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151216 |