CN105160352A

CN105160352A - High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution

Info

Publication number: CN105160352A
Application number: CN201510504284.2A
Authority: CN
Inventors: 周芳芳; 李俊材; 黄伟; 赵颖
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2015-08-18
Filing date: 2015-08-18
Publication date: 2015-12-16

Abstract

The present invention provides a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution. The high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution comprises the specific steps of: 1, searching a dimension subspace, i.e. confirming a target optimization subspace of which a two-dimensional projection effect needs to be improved and selecting a subspace with an excellent clustering structure; 2, constructing a reconstructed dimension set, i.e. transferring clustering information of the subspace with the excellent clustering structure to a reconstructed dimension; 3, constructing candidate optimal dimension subspace sets, i.e. carrying out free combination on each element of the reconstructed dimension set and the target optimization dimension subspace to generate the candidate optimal subspace sets; 4, screening an optimal dimension subspace set; 5, determining an optimal dimension subspace. According to the high-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution, the reconstruction concept is creatively introduced into the high-dimensional data subspace, the clustering projection effect of the target optimization subspace is improved and enhanced by the reconstructed dimension with stronger clustering information, and the problem of distortion of the clustering projection effect of the high-dimensional data subspace on the two-dimensional plane is solved.

Description

High-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction

Technical Field

The invention relates to the technical field of high-dimensional data analysis, processing and visualization, in particular to a method for optimizing subspace clustering projection effect by using relevant concepts and technologies such as subspace clustering, LDA (linear discriminant analysis), MDS (multidimensional scaling) and Dunn indexes.

Background

With the rapid development of computer technology in various industries, various data are increasing, and a large amount of data in the data are multidimensional data and even high-dimensional data. In view of the limitations of human cognitive ability and the lack of imagination of high-dimensional data space, it is still difficult for human beings to obtain deep information embedded in complex high-dimensional data. An optimal result state that can be achieved by the clustering method is that the similarity between objects belonging to the same cluster is as high as possible, while the similarity between objects belonging to different clusters is as low as possible. Through cluster analysis, people can obtain the knowledge of data hiding under the condition that the rules in a large amount of data are not clear. The noise level increases when the dimensionality of the data is high, the density of the data becomes more sparse, distance-based metrics become ineffective, and other adverse effects, such anomalies that manifest themselves as increasing dimensionality are called "dimensionality disasters". Therefore, how to process high-dimensional data is widely concerned and becomes a hot research problem.

Visualization technology is a visual perception technology that helps us understand data. By visualization, the oxford english dictionary is interpreted as "the ability or process to compose a mental context, or the vision of something that is not directly perceptible". The term also refers to the process of generating a visible image of what would otherwise be invisible. It is pointed out that: visualization is a series of transformations that convert raw analog data into a displayable image, the purpose of which is to convert the information into a format that is perceived by the human sensory system. The visualization technology is applied to basically all scientific research fields at present, and is a new subject in multiple fields such as computer graphics, signal processing, man-machine interaction, artificial intelligence and the like. The theoretical method of the visualization technology is applied to the field of pattern recognition, so that the flexibility and the innovation of human beings can be brought into full play. The visualization technology method can be used as an intermediate between abstract data and a user, provides the overall information of the data for the user, and helps the user determine interesting contents.

A subspace refers to a set reconstructed in dimensions from the original high-dimensional dataset, and its constructed dimensions may be partially repeated or completely different. Clustering is a common data analysis tool, and aims to divide a collection of large numbers of data points into several classes, so that the data in each class are maximally similar, while the data in different classes are maximally different. The dimensionality of the clustered data is very high, reaching hundreds or even thousands of dimensions, and clustering over such a high-dimensional space is a challenging problem. The reason is analyzed, and the method mainly comprises the following three points that 1) the essence of clustering is an unsupervised learning problem, and a plurality of supervised learning algorithms cannot be used; 2) in such a high-dimensional space, the distance between instances is governed by a large number of irrelevant attributes, which may cause the instances with close values of relevant attributes to be far apart, and the clustering result is not ideal; 3) the higher the dimensionality, the more computationally intensive, due to dimensionality disasters. For the above problem, a dimension reduction method is usually adopted to map high-dimensional data to a low-dimensional space for subspace selection, and then clustering is performed on the subspace. Commonly used dimensionality reduction methods include Principal Component Analysis (PCA), Multi-dimensional scaling (MDS), Local Linear Embedding (LLE), Linear Discriminant Analysis (LDA), and the like. The invention uses two dimensionality reduction technologies of MDS and LDA to carry out different combination processing on the high-dimensional data set.

Dunn separability index is a clustering validity function based on a geometric structure, and Dunn introduces a hard clustering validity function by using compactness and separability of a data set

Wherein,define class C_iAnd class C_jOf the distance between, and diam (C)_k) Define a cluster C_kDiameter ofIt is clear that a large Dunn index value indicates that the data set contains clusters with good closeness and separation.

Disclosure of Invention

The main objective of the invention is to optimize the high dimensional data subspace (dimension subspace) clustering projection effect. Aiming at the condition that the projection effect of the high-dimensional data subspace is not ideal, the invention provides a method capable of improving the projection effect.

The design idea of the invention is as follows: based on the idea of dimension reconstruction, by means of two dimension reduction technologies of MDS and LDA, necessary clustering information is collected from an original high-dimensional data subspace, the clustering information is constructed into a new dimension to be introduced into the original high-dimensional data subspace, an optimization subspace carrying stronger clustering information is formed, and then the clustering projection effect of the optimization subspace is better than that of the original subspace.

In order to achieve the technical purpose, the technical scheme of the invention is that,

a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction comprises the following steps:

step 1): exploring the dimension subspace: selecting a target optimization dimension subspace with poor clustering structure information, namely poor two-dimensional projection effect, and a plurality of dimension subspaces with good clustering structure information, namely good projection effect, from an original data set;

step 2): constructing a reconstruction dimension set: projecting the data objects of a plurality of dimension subspaces with good clustering structures obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set;

step 3): constructing a candidate optimization dimension subspace set: selecting one or more reconstruction dimensions in the reconstruction dimension set and a target optimization dimension subspace to be freely combined according to the reconstruction dimension set obtained in the step 2), and forming a candidate optimization dimension subspace set by all combinations;

step 4): screening out an optimized dimension subspace set: according to the candidate optimization dimension subspace set obtained in the step 3), for each candidate optimization dimension subspace in the set, firstly, projecting a data object in the candidate optimization dimension subspace to a two-dimensional plane through CMDS algorithm processing; then applying a K mean value clustering algorithm to all data points on the two-dimensional plane; finally, calculating a Dunn index of a clustering result, and if the Dunn index is larger than a certain preset threshold Q, determining the candidate optimized dimension subspace as an optimized dimension subspace;

step 5): determining an optimal dimension subspace: according to the optimized dimension subspace set determined in the step 4), if the set is empty, no optimal dimension subspace exists; otherwise, selecting the optimal dimension subspace corresponding to the maximum Dunn index from the set as the optimal dimension subspace.

The high-dimensional data subspace clustering projection effect optimization method based on the dimension reconstruction comprises the following steps in the step 1):

step 1.1): calculating a set of dimension points of a two-dimensional plane of the original dataset: firstly, calculating Pearson correlation coefficients of all dimensions of an original data set; then processing the obtained Pearson correlation coefficient through a CMDS algorithm to enable a dimension object of the original data set to be projected to a two-dimensional plane;

step 1.2): selecting a target optimization dimension subspace: processing the dimension point set on the two-dimensional plane by the original data set obtained in the step 1.1) through a K-means clustering algorithm to obtain clusters of K dimension points; then, the clusters of the K dimension points are used as K candidate dimension subspaces, and data objects of the candidate target optimization dimension subspaces are processed one by one through the CMDS; finally, calculating Dunn indexes of data points of K candidate dimension subspaces on a two-dimensional plane, and selecting the candidate dimension subspace with the minimum Dunn index and Dunn < N as a target optimization subspace, otherwise, not having the target optimization subspace, and directly ending the whole projection effect optimization process;

step 1.3): selecting a plurality of dimension subspaces with good clustering structures: according to the Dunn indexes of K candidate dimension subspaces obtained by calculation in the step 1.2), firstly setting a dimension subspace threshold value W for screening out dimension subspace information which meets the user requirement and has good clustering structure information, wherein the larger the value of W is, the stronger the screened dimension subspace clustering structure information is; all candidate dimension subspaces of Dunn > W are then selected as the dimension subspace with good cluster structure.

The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 2) comprises the following steps:

step 2.1): selecting a dimension subspace with a good clustering structure, and selecting an unselected dimension subspace from the plurality of dimension subspaces with good clustering obtained in the step 1.3);

step 2.2): calculating two-dimensional space data points, projecting all data objects in the dimensional subspace to a two-dimensional plane after the data objects are subjected to CMDS algorithm and normalization processing according to the dimensional subspace with the good clustering structure selected in the step 2.1);

step 2.3): calculating a discrimination straight line, and obtaining the discrimination straight line corresponding to the data point in the dimensionality subspace after processing the two-dimensional space data point calculated in the step 2.2) through an LDA algorithm;

step 2.4): calculating a projection value, and according to the two-dimensional plane data point calculated in the step 2.2) and the discrimination straight line calculated in the step 2.3), calculating the projection value according to the following projection formula:

projecting the two-dimensional plane data points one by one onto each discrimination line, and obtaining the value ranges of x and y of the two-dimensional plane data points obtained in the step 2.2) as [ -1,1]Then the discrimination line can be represented by a line segment of the line cut by a square having a (0,0) center, a side length of 2, and each side being either perpendicular or parallel to the coordinate axis, according to equation (1), where O represents the midpoint of the discrimination line segment, M represents any one data point on the two-dimensional plane, M' represents the projection point of M on the discrimination line, then M_rdThe projection value of the data point M on the discrimination line;

step 2.5): normalizing the projection values, wherein the projection value sequence of each discrimination line obtained in the step 2.4) needs to pass through the following formula:

performing data normalization process, wherein x_idRepresenting a single projection value, mu_dMean, σ, representing a list of projection values_dA standard deviation representing a list of projection values;

step 2.6): constructing a reconstruction dimension, calculating a standardized projection value of the two-dimensional plane data point on each judgment straight line according to the step 2.5), sequencing a projection value list corresponding to each judgment straight line according to the original sequence of the two-dimensional space data points to obtain a sequence which is the reconstruction dimension, and jumping to the step 2.1) until all the dimension subspaces with good clustering structures are processed.

The high-dimensional data subspace clustering projection effect optimization method based on the dimension reconstruction comprises the following steps in the step 3):

step 3.1): calculating a set C of all non-empty subsets of the set according to the set of reconstruction dimensions determined in step 2)_rd；

Step 3.2): optimizing a set D of dimensional subspaces according to the objective determined in step 1)_od；

Step 3.3): according to the collective Cartesian product C_rd×D_odThe set obtained by calculation is the candidate optimization subspace set K_cos。

The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 4) comprises the following steps:

step 4.1): order optimization dimensionDegree subspaceAnd setting a threshold Q;

step 4.2): calculating to obtain a candidate optimized dimension subspace set K according to the step 3.3)_cosIf, ifSelecting a candidate optimized dimension subspace s_coWherein s is_co∈K_cosThen candidate optimized dimension subspace set K_cosSubtracting the candidate optimized dimension subspace s_coI.e. K_cos＝K_cos-s_coJumping to the step 4.3) to continue execution; if it isThen represents the optimized dimension subspace K_osAfter the calculation is finished, jumping to the step 5) to continue executing;

step 4.3) optimizing the dimension subspace s selected according to step 4.1)_coObtaining a scatter diagram of a two-dimensional plane after being processed by an MDS algorithm, then applying a K-means clustering algorithm to data points of the scatter diagram, and calculating the Dunn index of the clusterIf it isThe candidate dimension is subspace s_coAdding an optimized dimension subspace set K_osI.e. K_os＝K_os+s_coThen jump to step 4.2) to continue execution.

The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 5) comprises the following steps:

the optimized dimension subspace K calculated according to the step 4)_osSelecting the corresponding Dunn indexThe largest optimized dimension subspace is used as the optimal dimension subspace s_mo

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a diagram of a specific implementation of the method of the present invention;

FIG. 3 is a conceptual illustration of the method of the present invention;

FIG. 4 is a schematic diagram of a dimension reconstruction method according to the method of the present invention;

FIG. 5 is a software tool schematic of the method of the present invention.

Detailed Description

In order that the objects, design considerations and advantages of the present invention will become more apparent, the invention will be further described in detail in the following section, taken in conjunction with the accompanying drawings.

The invention provides a high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction, which comprises the following five main steps as shown in figure 1: step 1): exploring the dimension subspace: selecting a target optimization dimension subspace with poor clustering structure information, namely poor two-dimensional projection effect, and a plurality of dimension subspaces with good clustering structure information, namely good projection effect, from an original data set; step 2): constructing a reconstruction dimension set: projecting the data objects of the plurality of good clustering dimension subspaces obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set; step 3): constructing a candidate optimization dimension subspace set: selecting one or more reconstruction dimensions in the reconstruction dimension set and a target optimization dimension subspace to be freely combined according to the reconstruction dimension set obtained in the step 2), and forming a candidate optimization dimension subspace set by all combinations; step 4): screening out an optimized dimension subspace set: according to the candidate optimization dimension subspace set obtained in the step 3), for each candidate optimization dimension subspace in the set, firstly, projecting a data object in the candidate optimization dimension subspace to a two-dimensional plane through CMDS algorithm processing; then applying a K mean value clustering algorithm to all data points on the two-dimensional plane; finally, calculating a Dunn index of a clustering result, and if the Dunn index is larger than a certain preset threshold Q (the threshold Q is determined according to the requirements of users), determining the candidate optimized dimension subspace as an optimized dimension subspace; step 5): determining an optimal dimension subspace: according to the optimized dimension subspace set determined in the step 4), if the set is empty, no optimal dimension subspace exists; otherwise, selecting the optimal dimension subspace corresponding to the maximum Dunn index from the set as the optimal dimension subspace.

The key steps involved in the method of the invention are explained in detail one by one, and the specific steps are as follows:

step one, loading a data set. In the implementation process of the method, a USDAFoodComposionData data set (the data set consists of 18 dimensions and 722 sample points) is selected as an experimental data set.

And step two, exploring the dimension subspace.

Calculating a dimension point set of a two-dimensional plane of the experimental data set. Firstly, calculating Pearson correlation coefficients of all dimensions of an original data set; and then processing the obtained Pearson correlation coefficient through a CMDS algorithm so that the dimensional object of the original data set is projected to a two-dimensional plane. According to the characteristics of the Pearson correlation coefficient, if the correlation between the dimensions is stronger, the dimension points on the two-dimensional plane are closer.

Selecting a target optimization dimension subspace. Firstly, performing K-means clustering processing on a dimension point set of the original data set obtained in the step 1.1) on a two-dimensional plane, and obtaining clusters of K dimension points; then, the clusters of the K dimension points are used as K candidate dimension subspaces, and data objects of the candidate target optimization dimension subspaces are processed one by one through a CMDS algorithm; and finally, calculating Dunn indexes of data points of the K candidate dimension subspaces on a two-dimensional plane, and selecting the candidate dimension subspace with the minimum Dunn index and Dunn < N (the threshold value N is determined according to the requirements of a user) as a target optimization subspace, otherwise, directly ending the whole projection effect optimization process without the target optimization subspace.

And thirdly, selecting a plurality of dimension subspaces with good clustering structures. According to the Dunn indexes of the K candidate dimension subspaces obtained by calculation in the step 1.2), firstly setting a threshold value W, wherein the threshold value is used for helping a user to screen out a dimension subspace which meets the user requirement and has good clustering structure information, and the larger the value of W is, the stronger the clustering structure information of the screened dimension subspace is; all candidate dimension subspaces of Dunn > W are then selected as the dimension subspace with good cluster structure.

And step three, determining a target optimization dimension subspace set. As shown in fig. 5- (a) and 5- (b), the dimension subspace 3 and the dimension subspace 4 are determined to optimize the dimension subspace as a target and constitute a set. Otherwise, ending.

And step four, determining a dimension subspace set with a good clustering structure. As shown in fig. 5- (a) and 5- (b), the dimension subspace 1 and the dimension subspace 2 are determined to be dimension subspaces having a good cluster structure and constitute a set. Otherwise, ending.

And step five, constructing a reconstruction dimension set. According to the dimension subspace with a good clustering structure found in the fourth step, in order to utilize the clustering information in the dimension subspaces, the clustering information in the subspaces can be effectively transferred to the reconstruction dimension by adopting the idea of reconstructing the dimension and certain algorithm steps, and the specific steps are as follows:

firstly, projecting a data object of a dimensionality subspace with a good clustering structure to a two-dimensional plane through a CMDS algorithm;

processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines;

and thirdly, projecting all data points on the two-dimensional plane to corresponding discrimination lines, so that a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension.

As a result, as shown in FIG. 5- (c), RD-A and RD-B represent the reconstruction dimensions generated in dimension subspace 1 and dimension subspace 2, respectively, and together constitute a set of reconstruction dimensions.

And step six, constructing a candidate optimization dimension subspace set. And according to the target optimization dimension subspace determined in the third step and the reconstruction dimension set determined in the fifth step, carrying out Cartesian product operation on the target optimization subspace and the reconstruction dimension set to form a candidate optimization dimension subspace set.

And seventhly, judging whether the candidate optimization dimension subspace set is empty or not, and performing iterative verification.

If the set is not empty.

The controller selects a candidate optimized dimension subspace from the set and removes the element from the set.

And (3) performing projection verification on the candidate optimized subspace through the capsule wall, comparing and observing whether the projection effect is improved, if so, jumping to 3 for further verification, and otherwise, jumping to the sixth step for continuous execution.

⒊ determine Dunn values for the planar data points after projection verification in calculation 2.

⒋, determines whether the maximum Dunn has been set. And if the setting is already set, jumping to 5 to continue execution, otherwise, jumping to 6 to continue execution.

⒌, comparing the Dunn value with the maximum Dunn value, if the Dunn value is larger than the maximum Dunn value, jumping to 6 to continue executing, otherwise, jumping to step six to continue executing.

⒍ sets the currently calculated Dunn value to the maximum Dunn value and sets the current candidate optimized dimension subspace to the optimal subspace.

② if the collection is empty. And finishing the whole optimized projection process.

Claims

1. A high-dimensional data subspace clustering projection effect optimization method based on dimension reconstruction is characterized by comprising the following steps:

step 2): constructing a reconstruction dimension set: projecting the data objects of the plurality of good clustering dimension subspaces obtained in the step 1) to a two-dimensional plane through a CMDS algorithm; then processing all data point sets on the two-dimensional plane through an LDA algorithm and constructing corresponding discrimination straight lines; finally, projecting all data points on a plurality of two-dimensional planes onto corresponding discrimination lines, wherein a set formed by projection values of all data points on each discrimination line forms a reconstruction dimension, and all reconstruction dimensions form a reconstruction dimension set;

2. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 1) comprises the following steps:

3. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 2) comprises the following steps:

4. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 3) comprises the following steps:

5. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 4) comprises the following steps:

step 4.1): order to optimize a dimension subspaceAnd setting a threshold Q;

step 4.2): calculating to obtain a candidate optimized dimension subspace set K according to the step 3.3)_cosIf, ifSelecting a candidate optimization dimensionSubspace s_coWherein s is_co∈K_cosThen candidate optimized dimension subspace set K_cosSubtracting the candidate optimized dimension subspace s_coI.e. K_cos＝K_cos-s_coJumping to the step 4.3) to continue execution; if it isThen represents the optimized dimension subspace K_osAfter the calculation is finished, jumping to the step 5) to continue executing;

6. The method for optimizing the clustering projection effect of the high-dimensional data subspace based on the dimension reconstruction as claimed in claim 1, wherein the step 5) comprises the following steps:

the optimized dimension subspace K calculated according to the step 4)_osSelecting the corresponding Dunn indexThe largest optimized dimension subspace is used as the optimal dimension subspace s_mo。