CN109271441B - High-dimensional data visual clustering analysis method and system - Google Patents
High-dimensional data visual clustering analysis method and system Download PDFInfo
- Publication number
- CN109271441B CN109271441B CN201811517242.2A CN201811517242A CN109271441B CN 109271441 B CN109271441 B CN 109271441B CN 201811517242 A CN201811517242 A CN 201811517242A CN 109271441 B CN109271441 B CN 109271441B
- Authority
- CN
- China
- Prior art keywords
- dimensional data
- dimension
- data
- expansion
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a high-dimensional data visual clustering analysis method and system. The method comprises the following steps: carrying out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data. The method or the system can effectively realize the visual clustering of high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.
Description
Technical Field
The invention relates to the field of high-dimensional data visual clustering, in particular to a high-dimensional data visual clustering analysis method and system.
Background
The visualization technology is an important data analysis tool, and the internal structure, information and knowledge of data are expressed mainly by computer graphics, image processing, signal processing and other methods, so that the method is beneficial to researches such as pattern recognition, outlier detection and the like. With the rapid development of computers and sensing equipment, multi-dimensional and even high-dimensional data widely exist in the fields of economy, medicine, military, industry and the like, such as high-dimensional functional magnetic resonance imaging data, three-layer defense systems of multi-dimensional structures and the like. The increase of data dimension and scale brings new opportunities for data visualization. However, the traditional rectangular coordinates can express three-dimensional data at most and are not suitable for visualization research of high-dimensional data.
At present, the high-dimensional visualization technology mainly has two types. One of them is a dimension reduction method, which maps high-dimensional data to a low-dimensional space and represents the reduced data by scatter or other symbols. Mainly comprises principal component analysis, self-organizing mapping, neuron measurement method and the like. Although the dimension reduction visualization method can overcome the dimension disaster of the visualization technology in a certain sense, the dimension reduction visualization method can cause the loss of potentially important information, and the accuracy of high-dimensional data analysis is restricted. Another type of method obtains visualization results without using dimension reduction techniques, such as scatter plot matrices, parallel coordinate systems, and heat plot, which can represent high-dimensional data information intact. However, as the dimension and scale of data increase, a large number of curves or color blocks are complicatedly interlaced together due to the limitation of a screen, and the effectiveness of visualization is greatly restricted.
Compared with the above methods, the Radial layout Visualization method represented by Radial Visualization (RadViz) and Star Coordinates (SC) has a significant advantage in expressing high-dimensional data. The radial layout visualization method characterizes the data dimensions by circular radii and maps each individual to a point in a low dimensional space. The method can not only efficiently express any dimension data in a low-dimensional space, but also project the data with similar characteristics to similar positions, thereby forming a better visual clustering effect. However, RadViz is defined as a general non-linear mapping that does not take into account the shape and distribution of the data; and SC itself is a linear visualization method. Therefore, when the data is a nonlinear manifold structure, the traditional radial layout visualization method has a limitation in capturing the nonlinear data structure.
Therefore, how to efficiently realize the visualized clustering of the high-dimensional data, especially the high-dimensional data containing the nonlinear structure, is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a high-dimensional data visual clustering analysis method and system, which are used for efficiently realizing visual clustering of high-dimensional data, particularly high-dimensional data containing nonlinear structures.
In order to achieve the purpose, the invention provides the following scheme:
a method of high dimensional data visualization cluster analysis, the method comprising:
carrying out normalization preprocessing on the high-dimensional data;
performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion;
and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data.
Optionally, the performing normalization preprocessing on the high-dimensional data specifically includes:
according to the formulaNormalizing pre-processing the high dimensional data, wherein FkmAndrespectively representing the original attribute value and the normalized attribute value of the kth group of high-dimensional data on the mth dimension; max (F)m) And min (F)m) Respectively representing the maximum attribute value and the minimum attribute value of the high-dimensional data F on the mth dimension; k1, 2,., K, M1, 2,., M, K, and M represent the scale and the dimension of the high-dimensional data F, respectively.
Optionally, the performing, by the multi-target genetic algorithm, dimension expansion on the high-dimensional data after the normalization processing to obtain the high-dimensional data after the dimension expansion specifically includes:
initializing a population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;
constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data;
screening out an optimal individual through the multi-target evaluation index, wherein the optimal individual represents an optimal expansion state;
and performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.
Optionally, the constructing a multi-target evaluation index specifically includes:
determining the expansion dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population;
according to the formulaDetermining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index and K represents a gauge of the high-dimensional data FModule, tkRepresenting rank ordering of the kth group of data, according to formulaIt is determined that u and s both represent the number of nearest neighbor data points, typically u is 4, s is 10, NNkyAnd nnkyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectivelyklAnd nnktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space;
according to the formulaDetermining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, Ci、CjAnd CkAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,represents a cluster CiAnd cluster CjThe distance of (d);represents a cluster CkOf (c) is measured.
Optionally, performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion, specifically including:
counting the probability of r equal division of each dimension of the high-dimensional data after normalization processing in a value range of [0, 1], and determining a probability histogram of each dimension;
dividing each probability histogram by utilizing a neighbor propagation clustering algorithm, and determining each dimension division result;
and performing dimension expansion according to the division result and the optimal expansion state to obtain dimension-expanded high-dimensional data, wherein the dimension after each dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of each dimension-expanded data is equal to the data value of the corresponding original dimension.
Optionally, the mapping each group of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method to implement the visual clustering of the high-dimensional data specifically includes:
constructing a circle-like space COThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
according toDetermining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;
determining a Fiedler vector by solving a eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similarity matrix;
sorting the dimensionalities of the high-dimensional data after each group of dimensionality expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;
according to the formulaDetermining the dimensions of the sorted high-dimensional data to be COCoordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
In the quasi-circular space, for any high-dimensional dataAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data, and determining the geometric center of the polygons;
and reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visual clustering of high-dimensional data.
A high dimensional data visualization cluster analysis system, the system comprising:
the preprocessing module is used for carrying out normalization preprocessing on the high-dimensional data;
the dimensionality extension module is used for carrying out dimensionality extension on the high-dimensional data after the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data after the dimensionality extension;
and the mapping module is used for mapping each group of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method so as to realize the visualization clustering of the high-dimensional data.
Optionally, the dimension extension module specifically includes:
the initialization unit is used for initializing the population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;
the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data;
the screening unit is used for screening out the optimal individual through the multi-target evaluation index, and the optimal individual represents the optimal expansion state;
and the dimension expansion unit is used for performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.
Optionally, the dimension extension unit specifically includes:
the statistical subunit is used for counting the probability that r is equally divided in the value range of [0, 1] of each dimension of the high-dimensional data after the normalization processing, and determining a probability histogram of each dimension;
the dividing unit is used for dividing each probability histogram by utilizing a neighbor propagation clustering algorithm and determining each dimension dividing result;
and the expansion subunit is used for performing dimension expansion according to the division result and the optimal expansion state to obtain high-dimensional data after the dimension expansion, wherein the dimension after the dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of the data after the dimension expansion is equal to the data value on the corresponding original dimension.
Optionally, the mapping module specifically includes:
a circle-like space construction unit for constructing a circle-like space COThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
a similarity matrix determination unit for determining a similarity matrix based onDetermining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiThe sorting value of the ith dimension for the kth group of data is the number of each group of high-dimensional data with the dimension expanded by 1 to M integersAccording to the value of the attribute value in each dimension, carrying out the order marking;
the Fiedler vector determining unit is used for determining a Fiedler vector by solving the eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similar matrix;
the sorting unit is used for sorting the dimensionality of the high-dimensional data after each group of dimensionality expansion according to the size of the elements in the Fiedler vector to obtain the sorted high-dimensional data;
a coordinate point determination unit for determining a coordinate point according to a formulaDetermining the dimensions of the sorted high-dimensional data to be COCoordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
A two-dimensional mapping point determining unit for determining any high-dimensional data in a circle-like spaceAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
the geometric center determining unit is used for forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data and determining the geometric centers of the polygons;
and the visual clustering realization unit is used for reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point and realizing the visual clustering of high-dimensional data.
Compared with the prior art, the invention has the following technical effects: the method carries out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the high-dimensional data visualized clustering analysis method and the system provided by the invention can ensure the scientificity and effectiveness of visualized clustering analysis, thereby more efficiently realizing the visualized clustering of the high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a high dimensional data visualization clustering analysis method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a high dimensional data visualization cluster analysis system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a probability histogram and a partitioning result of each dimension of the iris dataset when r is 20 according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a high-dimensional data visual clustering analysis method and system, which are used for efficiently realizing visual clustering of high-dimensional data, particularly high-dimensional data containing nonlinear structures.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the high-dimensional data visualization cluster analysis method includes the following steps:
step 101: and carrying out normalization preprocessing on the high-dimensional data.
According to the formulaNormalizing pre-processing the high dimensional data, wherein FkmAndrespectively representing the original attribute value and the normalized attribute value of the kth group of high-dimensional data on the mth dimension; max (F)m) And min (F)m) Respectively representing the maximum attribute value and the minimum attribute value of the high-dimensional data F on the mth dimension; k1, 2,., K, M1, 2,., M, K, and M represent the scale and the dimension of the high-dimensional data F, respectively.
Step 102: and performing dimension expansion on the high-dimensional data after the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data after the dimension expansion. The method specifically comprises the following steps:
1) initializing a population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data.
2) Constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data. Specifically, the method comprises the following steps:
and determining the extension dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population.
According to the formulaDetermining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index, K represents the scale of the high-dimensional data F, and tkRepresenting rank ordering of the kth group of data, according to formulaIt is determined that u and s both represent the number of nearest neighbor data points, typically u is 4, s is 10, NNkyAnd nnkyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectivelyklAnd nnktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space;
according to the formulaDetermining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, Ci、CjAnd CkAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,represents a cluster CiAnd cluster CjThe distance of (d);represents a cluster CkOf (c) is measured.
3) And screening out the optimal individual through the multi-target evaluation index, wherein the optimal individual represents the optimal expansion state.
4) And performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion. Counting the probability of r equal division of each dimension of the high-dimensional data after normalization processing in a value range of [0, 1], and determining a probability histogram of each dimension; dividing each probability histogram by utilizing a neighbor propagation clustering algorithm, and determining each dimension division result; and performing dimension expansion according to the division result and the optimal expansion state to obtain dimension-expanded high-dimensional data, wherein the dimension after each dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, only one-dimensional data value of each dimension-expanded data is equal to the data value of the corresponding original dimension, the dimension is equal to the equal division of the data value of the original dimension, and the data values of the rest dimensions are 0.
Step 103: and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data. The method specifically comprises the following steps:
construct a structure quasi-circular space COThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
according toDetermining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;
determining a Fiedler vector by solving a eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similarity matrix;
sorting the dimensionalities of the high-dimensional data after each group of dimensionality expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;
according to the formulaDetermining the dimensions of the sorted high-dimensional data to be COOn a circular arcCoordinate point Vλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
In the quasi-circular space, for any high-dimensional dataAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data, and determining the geometric center of the polygons;
and reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visual clustering of high-dimensional data.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the method carries out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the high-dimensional data visualized clustering analysis method and the system provided by the invention can ensure the scientificity and effectiveness of visualized clustering analysis, thereby more efficiently realizing the visualized clustering of the high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.
The visualized cluster analysis method proposed in this patent is described below by taking a 150-dimensional iris data set as an example.
Step A: the method comprises the following steps of carrying out normalization pretreatment on the iris data set, wherein the normalization pretreatment specifically comprises the following steps:
according to the formulaNormalization pre-processing of Iris data set F, wherein FkmAndrespectively representing the original attribute value and the normalized attribute value of the kth group of iris data sets in the mth dimension; max (F)m) And min (F)m) Respectively representing the maximum and minimum attribute values of the iris dataset in the mth dimension; k 1,2, 150, m 1,2,3, 4;
and B, performing dimension expansion on the iris data set subjected to the normalization treatment through an NSGAII multi-target genetic algorithm to obtain an iris data set subjected to the dimension expansion, wherein the method specifically comprises the following steps:
initializing a population of the NSGAII multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents the expansion state of the high-dimensional data binary code, the length of the high-dimensional data binary code is Iris florida dataset dimension 4, wherein 1 and 0 in the binary code respectively represent the corresponding Iris florida dataset dimension and do not carry out dimension expansion;
constructing a multi-target evaluation index, wherein the multi-target evaluation index comprises the expansion dimension, the topology maintenance index and the Dunn index of the iris data set;
screening out an optimal individual through the multi-target evaluation index, wherein the optimal individual represents an optimal expansion state of the iris data set;
performing dimension expansion on the iris data set subjected to the normalization processing according to the optimal expansion state to obtain the iris data set subjected to the dimension expansion, and specifically comprising the following steps:
counting the probability that each dimension of the iris data set after normalization processing appears in 20 equal divisions on the value range of [0, 1], and determining 4 dimension probability histograms;
dividing each of the 4 probability histograms by using a neighbor propagation clustering algorithm to determine 4 dimension division results, wherein the division of the probability distribution can be regarded as clustering two-dimensional data, and the two-dimensional data are x-axis (namely value) and y-axis (namely probability value) of each dimension probability distribution histogram respectively. Fig. 3 shows probability histograms of 4 dimensions of the iris data set and the division results, in which two-dimensional data coordinates are represented by scatter points, and scatter points of the same division type are connected by a broken line of the same type.
And performing dimension expansion according to the division result and the optimal expansion state to obtain a dimension-expanded high-dimensional iris data set, wherein the dimension after 4 dimension expansion is equal to the number of corresponding probability distribution histogram cluster, each dimension-expanded data has and only has one-dimensional data value equal to the data value on the corresponding original dimension, the dimension is equal to the equal division of the original dimension data value, and the data values on the other dimensions are 0. For example, fig. 3 illustrates that the first dimension of the iris dataset is divided into 3 parts, including 6, 7, and 7 data points, respectively. I.e. the first dimension of the Iris dataset is extended to three new dimensions and divided where the probabilities are 0.3 and 0.65. From this, it can be seen that if the data values of the 3 sets of data in the first dimension of the iris data set are 0.2, 0.5, and 0.8, the values in the new dimension are [0.200], [00.50], [000.8], respectively.
And C: respectively mapping the high-dimensional iris data sets with the extended dimensions to a circle-like space by using a circle-like mapping visualization method, specifically comprising:
constructing a circle-like space, wherein the circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
according toDetermining each groupObtaining a similarity matrix by the correlation between dimensionalities of the data sets of the iris after dimensionality extension, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiThe ordering value of the kth group of data in the ith dimension is a numerical value obtained by ordering each group of data of the iris data subjected to dimension expansion according to the attribute value of each dimension by using 1 to N integers, wherein N is the dimension of the iris data set subjected to dimension expansion;
determining a Fiedler vector by solving a eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similarity matrix;
sorting the dimensions of the high-dimensional iris data sets after the dimension expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;
according to the formulaDetermining the dimension of the sorted high-dimensional iris data set at COCoordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
Iris data set expanded for any dimension in circle-like spaceAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,properties in the λ (i) th dimension for the kth group of dataValue, any individualCorresponding to the N two-dimensional mapping points;
forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the iris data sets, and determining the geometric center of the polygons;
and reducing the same cluster spacing of the geometric center of the polygon through a t-SNE algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visualized clustering of the iris data set. .
As shown in fig. 2, the present invention further provides a high dimensional data visualization cluster analysis system, which includes:
and the preprocessing module 201 is configured to perform normalization preprocessing on the high-dimensional data. According to the formulaNormalizing pre-processing the high dimensional data, wherein FkmAndrespectively representing the original attribute value and the normalized attribute value of the kth group of high-dimensional data on the mth dimension; max (F)m) And min (F)m) Respectively representing the maximum attribute value and the minimum attribute value of the high-dimensional data F on the mth dimension; k1, 2,., K, M1, 2,., M, K, and M represent the scale and the dimension of the high-dimensional data F, respectively.
And the dimension expansion module 202 is configured to perform dimension expansion on the normalized high-dimensional data through a multi-target genetic algorithm to obtain the high-dimensional data after the dimension expansion.
The dimension extension module 202 specifically includes:
the initialization unit is used for initializing the population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;
the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises the extended dimension and topology of the high-dimensional data
Maintenance index, Dunn index; specifically, the number of 1 in each individual binary code in the population is counted to determine the extension dimension of the high-dimensional data; according to the formulaDetermining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index, K represents the scale of the high-dimensional data F, and tkRepresenting rank ordering of the kth group of data, according to formulaIt is determined that u and s both represent the number of nearest neighbor data points, typically u is 4, s is 10, NNkyAnd nnkyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectivelyklAnd nnktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space;
according to the formulaDetermining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, Ci、CjAnd CkAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,represents a cluster CiAnd cluster CjThe distance of (d);represents a cluster CkOf (c) is measured.
The screening unit is used for screening out the optimal individual through the multi-target evaluation index, and the optimal individual represents the optimal expansion state;
and the dimension expansion unit is used for performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.
The dimension extension unit specifically includes:
the statistical subunit is used for counting the probability that r is equally divided in the value range of [0, 1] of each dimension of the high-dimensional data after the normalization processing, and determining a probability histogram of each dimension;
the dividing unit is used for dividing each probability histogram by utilizing a neighbor propagation clustering algorithm and determining each dimension dividing result;
and the expansion subunit is used for performing dimension expansion according to the division result and the optimal expansion state to obtain high-dimensional data after the dimension expansion, wherein the dimension after the dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of the data after the dimension expansion is equal to the data value on the corresponding original dimension.
And the mapping module 203 is configured to map each set of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method, so as to implement visual clustering of the high-dimensional data.
The mapping module 203 specifically includes:
a similarity matrix determination unit for determining a similarity matrix based onDetermining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;
the Fiedler vector determining unit is used for determining a Fiedler vector by solving the eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similar matrix;
the sorting unit is used for sorting the dimensionality of the high-dimensional data after each group of dimensionality expansion according to the size of the elements in the Fiedler vector to obtain the sorted high-dimensional data;
a coordinate point determination unit for determining a coordinate point according to a formulaDetermining the dimensions of the sorted high-dimensional data to be COCoordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
A two-dimensional mapping point determining unit for determining any high-dimensional data in a circle-like spaceAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
the geometric center determining unit is used for forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data and determining the geometric centers of the polygons;
and the visual clustering realization unit is used for reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point and realizing the visual clustering of high-dimensional data.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (7)
1. A high-dimensional data visualization cluster analysis method is characterized by comprising the following steps:
carrying out normalization preprocessing on the high-dimensional data;
performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the method specifically comprises the following steps:
initializing a population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;
constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data; the method specifically comprises the following steps:
determining the expansion dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population;
according to the formulaDetermining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index, K represents the scale of the high-dimensional data F, and tkRepresenting rank ordering of the kth group of data, according to formulaDetermining that u and s both represent the number of nearest neighbor data points, NNkyAnd nnkyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectivelyklAnd nnktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space; according to the formulaDetermining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, Ci、CjAnd CkAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,represents cluster C and the distance of cluster C;represents the diameter of cluster C;
screening out an optimal individual through the multi-target evaluation index, wherein the optimal individual represents an optimal expansion state;
performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion;
and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data.
2. The high-dimensional data visualization cluster analysis method according to claim 1, wherein the normalization preprocessing is performed on the high-dimensional data, and specifically comprises: according to the formulaNormalizing pre-processing the high dimensional data, wherein FkmAndrespectively representing the original attribute value and the normalized attribute value of the kth group of high-dimensional data on the mth dimension; max (F)m) And min (F)m) Respectively representing the maximum attribute value and the minimum attribute value of the high-dimensional data F on the mth dimension; k1, 2,., K, M1, 2,., M, K, and M represent the scale and the dimension of the high-dimensional data F, respectively.
3. The high-dimensional data visualization cluster analysis method according to claim 1, wherein performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion specifically comprises:
counting the probability of r equal division of each dimension of the high-dimensional data after normalization processing in a value range of [0, 1], and determining a probability histogram of each dimension;
dividing each probability histogram by utilizing a neighbor propagation clustering algorithm, and determining each dimension division result;
and performing dimension expansion according to the division result and the optimal expansion state to obtain dimension-expanded high-dimensional data, wherein the dimension after each dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of each dimension-expanded data is equal to the data value of the corresponding original dimension.
4. The method for high-dimensional data visual cluster analysis according to claim 1, wherein the step of mapping each set of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method to realize visual clustering of the high-dimensional data specifically comprises the steps of:
constructing a circle-like space C0The circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
according toDetermining each groupObtaining a similarity matrix by the correlation between the dimensionalities of the high-dimensional data after the dimensionality expansion, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;
determining a Fiedler vector by solving a eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similarity matrix;
sorting the dimensionalities of the high-dimensional data after each group of dimensionality expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;
according to the formulaDetermining the dimensions of the sorted high-dimensional data to be C0Coordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
In the quasi-circular space, for any high-dimensional dataAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data, and determining the geometric center of the polygons;
and reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visual clustering of high-dimensional data.
5. A high dimensional data visualization cluster analysis system, the system comprising:
the preprocessing module is used for carrying out normalization preprocessing on the high-dimensional data;
the dimensionality extension module is used for carrying out dimensionality extension on the high-dimensional data after the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data after the dimensionality extension;
the mapping module is used for mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data;
the dimension extension module specifically includes:
the initialization unit is used for initializing the population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;
the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data; the method specifically comprises the following steps:
determining the expansion dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population;
according to the formulaDetermining a topology maintenance indicator for each of said individuals, wherein TP represents topology maintenanceIndex, K denotes the scale of the high dimensional data F, tkRepresenting rank ordering of the kth group of data, according to formulaDetermining that u and s both represent the number of nearest neighbor data points, NNkyAnd nnkyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectivelyklAnd nnktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space; according to the formulaDetermining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, Ci、CjAnd CkAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,represents cluster C and the distance of cluster C;represents the diameter of cluster C;
the screening unit is used for screening out the optimal individual through the multi-target evaluation index, and the optimal individual represents the optimal expansion state;
and the dimension expansion unit is used for performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.
6. The high-dimensional data visualization cluster analysis system according to claim 5, wherein the dimension extension unit specifically comprises:
the statistical subunit is used for counting the probability that r is equally divided in the value range of [0, 1] of each dimension of the high-dimensional data after the normalization processing, and determining a probability histogram of each dimension;
the dividing unit is used for dividing each probability histogram by utilizing a neighbor propagation clustering algorithm and determining each dimension dividing result;
and the expansion subunit is used for performing dimension expansion according to the division result and the optimal expansion state to obtain high-dimensional data after the dimension expansion, wherein the dimension after the dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of the data after the dimension expansion is equal to the data value on the corresponding original dimension.
7. The high-dimensional data visualization cluster analysis system according to claim 5, wherein the mapping module specifically comprises:
a circle-like space construction unit for constructing a circle-like space C0The circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;
a similarity matrix determination unit for determining a similarity matrix based onDetermining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein SijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, tkiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;
the Fiedler vector determining unit is used for determining a Fiedler vector by solving the eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similar matrix;
the sorting unit is used for sorting the dimensionality of the high-dimensional data after each group of dimensionality expansion according to the size of the elements in the Fiedler vector to obtain the sorted high-dimensional data;
a coordinate point determination unit for determining a coordinate point according to a formulaDetermining the dimensions of the sorted high-dimensional data to be C0Coordinate point V on arcλ(i)Wherein, in the step (A),the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.
A two-dimensional mapping point determining unit for determining any high-dimensional data in a circle-like spaceAt the origin of coordinates and coordinate point Vλ(i)On a straight line connecting, determining the distance to the origin of coordinates asIs recorded as a two-dimensional mapping point, wherein,for the property value of the kth group of data in the lambda (i) dimension, any one of the individualsCorresponding to the N two-dimensional mapping points;
the geometric center determining unit is used for forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data and determining the geometric centers of the polygons;
and the visual clustering realization unit is used for reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point and realizing the visual clustering of high-dimensional data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811517242.2A CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811517242.2A CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109271441A CN109271441A (en) | 2019-01-25 |
CN109271441B true CN109271441B (en) | 2020-09-01 |
Family
ID=65187645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811517242.2A Active CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109271441B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162568B (en) * | 2019-05-24 | 2021-01-08 | 东北大学 | Three-dimensional data visualization method based on PCA-Radviz |
CN110308873B (en) * | 2019-06-24 | 2023-04-07 | 浙江大华技术股份有限公司 | Data storage method, device, equipment and medium |
CN110458187B (en) * | 2019-06-27 | 2020-07-31 | 广州大学 | Malicious code family clustering method and system |
CN110781569B (en) * | 2019-11-08 | 2023-12-19 | 桂林电子科技大学 | Abnormality detection method and system based on multi-resolution grid division |
CN113095427B (en) * | 2021-04-23 | 2022-09-13 | 中南大学 | High-dimensional data analysis method and face data analysis method based on user guidance |
US12026450B2 (en) | 2022-08-01 | 2024-07-02 | International Business Machines Corporation | Visual representation for higher dimension data sets |
CN116049697A (en) * | 2023-01-10 | 2023-05-02 | 苏州科技大学 | Interactive clustering quality improving method based on user intention learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764676B (en) * | 2018-05-17 | 2020-10-30 | 南昌航空大学 | High-dimensional multi-target evaluation method and system |
-
2018
- 2018-12-12 CN CN201811517242.2A patent/CN109271441B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN109271441A (en) | 2019-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271441B (en) | High-dimensional data visual clustering analysis method and system | |
Zhang et al. | Local density adaptive similarity measurement for spectral clustering | |
CN112990010B (en) | Point cloud data processing method and device, computer equipment and storage medium | |
CN103164701B (en) | Handwritten Numeral Recognition Method and device | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN108564130B (en) | Infrared target identification method based on monogenic features and multi-kernel learning | |
CN103403704A (en) | Method and device for finding nearest neighbor | |
CN108764676B (en) | High-dimensional multi-target evaluation method and system | |
CN104282025A (en) | Biomedical image feature extraction method | |
CN108960335A (en) | One kind carrying out efficient clustering method based on large scale network | |
CN110188864B (en) | Small sample learning method based on distribution representation and distribution measurement | |
CN110083731B (en) | Image retrieval method, device, computer equipment and storage medium | |
CN113496260B (en) | Grain depot personnel non-standard operation detection method based on improved YOLOv3 algorithm | |
CN114332172A (en) | Improved laser point cloud registration method based on covariance matrix | |
CN110390337B (en) | Ship individual identification method | |
WO2023050461A1 (en) | Data clustering method and system, and storage medium | |
US20220156416A1 (en) | Techniques for comparing geometric styles of 3d cad objects | |
CN105718950B (en) | A kind of semi-supervised multi-angle of view clustering method based on structural constraint | |
CN109977787B (en) | Multi-view human behavior identification method | |
CN113627522A (en) | Image classification method, device and equipment based on relational network and storage medium | |
Borges et al. | Spatial-time motifs discovery | |
Teng et al. | The calculation of similarity and its application in data mining | |
Xue | Comparison of conventional and lightweight convolutional neural networks for Image Classification | |
Shi et al. | Metric-based curve clustering and feature extraction in flow visualization | |
Lu et al. | K‐Nearest Neighbor Intervals Based AP Clustering Algorithm for Large Incomplete Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |