CN109271441A - A kind of visualization clustering method of high dimensional data and system - Google Patents
A kind of visualization clustering method of high dimensional data and system Download PDFInfo
- Publication number
- CN109271441A CN109271441A CN201811517242.2A CN201811517242A CN109271441A CN 109271441 A CN109271441 A CN 109271441A CN 201811517242 A CN201811517242 A CN 201811517242A CN 109271441 A CN109271441 A CN 109271441A
- Authority
- CN
- China
- Prior art keywords
- dimension
- dimensional data
- high dimensional
- data
- extension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of visualization clustering method of high dimensional data and systems.This method comprises: pretreatment is normalized to high dimensional data;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the high dimensional data after obtaining dimension extension;The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, realizes the visualization cluster of high dimensional data.This method or system can efficiently realize high dimensional data especially and include the visualization cluster of nonlinear organization high dimensional data.
Description
Technical field
The present invention relates to high dimensional datas to visualize cluster field, visualizes clustering more particularly to a kind of high dimensional data
Method and system.
Background technique
Visualization technique is a kind of important data analysis tool, mainly utilizes computer graphics, image procossing, letter
The internal structure, information and knowledge of data are expressed in the methods of number processing, are conducive to the research such as pattern-recognition, outlier detection.
With the rapid development of computer and sensing equipment, multidimensional even high dimensional data has been widely present economy, medicine, military affairs and industry
Equal fields, such as higher-dimension functional magnetic resonance imaging data, three layers of defense system of multidimensional structure etc..The increasing of data dimension and scale
It adds to data visualization and brings new opportunity.But traditional rectangular co-ordinate most multipotency expresses three-dimensional data, is not suitable for height
Dimension data visual research.
There are two main classes for higher-dimension visualization technique at present.Wherein, one kind is dimension reduction method, and high dimensional data is mapped to low-dimensional
Space, and the data after indicating dimensionality reduction with scatterplot or other symbols.It is mainly reflected including principal component analysis, self-organizing, neural elementary length
Amount method etc..Although Approach of Dimension Reduction for Visualization can overcome the dimension disaster of visualization technique in some sense, may lead
The loss for causing potential important information restricts the accuracy of High dimensional data analysis.Another kind of method is without using dimensionality reduction technology
In the case of obtain visualization result, such as scatterplot matrices, parallel coordinate system and hotspot graph, can indicate intactly higher-dimension
Data information.But with the increase of data dimension and scale, due to the limitation of screen, a large amount of curve or color lump can be intricately
Weave in greatlys restrict visual validity.
Compared to the above method, with radial coordinate method for visualizing (Radial Visualization, RadViz) and star
Coordinate (Star Coordinates, SC) is that the radial layout method for visualizing of representative is expressing high dimensional data with apparent excellent
Gesture.Radial layout method for visualizing utilizes circular radius characterize data dimension, and each individual is mapped to the one of lower dimensional space
A point.Arbitrary Dimensions evidence can not only be efficiently expressed in lower dimensional space, and can be by the data projection with similar features to phase
Close position, to form preferable visualization Clustering Effect.But RadViz is defined as the general shape for not considering data
The Nonlinear Mapping of shape and distribution;And SC itself is a kind of linear method for visualizing.Therefore when data are non-linearity manifold structure
When, there are limitations in capture nonlinear data structure for traditional radial layout method for visualizing.
Therefore, how efficiently to realize high dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data,
The technical issues of as those skilled in the art's urgent need to resolve.
Summary of the invention
The object of the present invention is to provide a kind of visualization clustering method of high dimensional data and systems, efficiently to realize
High dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data.
To achieve the above object, the present invention provides following schemes:
A kind of high dimensional data visualization clustering method, which comprises
Pretreatment is normalized to high dimensional data;
Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, after obtaining dimension extension
High dimensional data;
The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, it is real
The visualization cluster of existing high dimensional data.
Optionally, described that pretreatment is normalized to high dimensional data, it specifically includes:
According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWithRespectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm)
Respectively indicate high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ..., M, K
The scale and dimension of high dimensional data F are respectively indicated with M.
Optionally, described that dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, it obtains
High dimensional data to after dimension extension, specifically includes:
Initialize the population of the multi-objective genetic algorithm;The population includes multiple individuals;Described in the individual indicates
The extended mode of high dimensional data;
Construct Multi-target evaluation index;Multi-target evaluation index includes the extension dimension of the high dimensional data, topology holding
Index, Dunn index;
Go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extended mode;
Dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, obtains dimension expansion
High dimensional data after exhibition.
Optionally, the building Multi-target evaluation index, specifically includes:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined
Number;
According to formulaDetermine that the topology of each individual keeps index, wherein TP expression is opened up
Holding index is flutterred, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually
U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated,
nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to
Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k
Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates
Cluster CkDiameter.
Optionally, described that dimension expansion is carried out to the high dimensional data after normalized according to the optimal extended mode
Exhibition, the high dimensional data after obtaining dimension extension, specifically includes:
Each dimension of high dimensional data after counting the normalized r etc. in [0,1] value range divide the general of appearance
Rate determines the probability histogram of each dimension;
Each probability histogram is divided using neighbour's propagation clustering algorithm, determines each dimension division result;
Dimension extension is carried out according to the division result and the optimal extended mode, the higher-dimension after obtaining dimension extension
Data, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, after each dimension extension
Data have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Optionally, described to be mapped to the high dimensional data after the extension of dimension described in each group using class circle mapping method for visualizing
Class space of circles is realized the visualization cluster of high dimensional data, is specifically included:
Construct class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toBetween high dimensional data dimension after determining the extension of each group dimension
Correlation, obtain similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high dimensional data F
Scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to be expanded the dimension using 1 to M integer
The each group of data of high dimensional data after exhibition carries out the numerical value of mark sequence according to the attribute value size in each dimension;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, Fiedler is determined
Vector;
It is arranged according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension
Sequence, the high dimensional data after being sorted;
According to formulaEach dimension of high dimensional data after determining sequence is in COCircle
Coordinate points V on arcλ(i), whereinVector λ indicate Fiedler vector element size mark sequence to
Amount, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determine
Distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is tieed up for kth group data in λ (i)
On attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines polygon
Geometric center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon
The different cluster spacing of geometric center determines map point location, realizes high dimensional data visualization cluster.
A kind of high dimensional data visualization cluster analysis system, the system comprises:
Preprocessing module, for pretreatment to be normalized to high dimensional data;
Dimension expansion module, for carrying out dimension expansion to the high dimensional data after normalized by multi-objective genetic algorithm
Exhibition, the high dimensional data after obtaining dimension extension;
Mapping block, for being mapped the high dimensional data after the extension of dimension described in each group using class circle mapping method for visualizing
To class space of circles, the visualization cluster of high dimensional data is realized.
Optionally, the dimension expansion module, specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Institute
Stating individual indicates the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the high dimensional data
Extend dimension, topology keeps index, Dunn index;
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual is indicated most
Excellent extended mode;
Dimension expanding element, for being tieed up according to the optimal extended mode to the high dimensional data after normalized
Degree extension, the high dimensional data after obtaining dimension extension.
Optionally, the dimension expanding element, specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after the normalized in [0,1] value range
Upper r etc. points of probability occurred, determine the probability histogram of each dimension;
Subelement is divided, for being divided using neighbour's propagation clustering algorithm to each probability histogram, is determined each
Dimension division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, is obtained
High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters,
Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Optionally, the mapping block, specifically includes:
Class space of circles construction unit, for constructing class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system with original
Point is the unit space of circles in the center of circle;
Similar matrix determination unit is used for basisDetermine that each group dimension expands
The correlation between high dimensional data dimension after exhibition, obtains similar matrix, wherein SijIt is arranged for the i-th row jth in the similar matrix
Element, K indicate high dimensional data F scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1
The each group of data of the high dimensional data after dimension extension is marked according to the attribute value size in each dimension to M integer
The numerical value of sequence;
Fiedler vector determination unit, for the Laplace matrix maximum eigenvalue pair by solving the similar matrix
The feature vector answered determines Fiedler vector;
Sequencing unit, for the high dimensional data after being extended according to element size in Fiedler vector to each group dimension
Dimension be ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaAfter determining sequence
Each dimension of high dimensional data is in COCoordinate points V on circular arcλ(i), whereinVector λ indicate Fiedler to
The mark sequence vector of secondary element size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the high dimension after sequence
According to dimension;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and sit
Punctuate Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinFor attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one for being made up of the corresponding two-dimensional space point set of each group of data
Polygon, and determine the geometric center of polygon;
Visualization cluster realizes unit, reduces the polygon geometric center for being distributed neighborhood embedded mobile GIS by t-
With cluster spacing, the different cluster spacing for increasing the polygon geometric center determines map point location, realizes that high dimensional data visualization is poly-
Class.
Compared with prior art, the present invention has following technical effect that pre- place is normalized to high dimensional data in the present invention
Reason;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the height after obtaining dimension extension
Dimension data;The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, it is real
The visualization cluster of existing high dimensional data, high dimensional data visualization clustering method provided by the invention and system can ensure can
Science, validity depending on changing clustering, so as to more efficiently realize that high dimensional data especially includes nonlinear organization
The visualization of high dimensional data clusters.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is that high dimensional data of the embodiment of the present invention visualizes clustering method flow chart;
Fig. 2 is the structural block diagram that high dimensional data of the embodiment of the present invention visualizes cluster analysis system;
Fig. 3 is the embodiment of the present invention as r=20, the probability histogram and division result of each dimension of flag flower data set
Schematic diagram.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of visualization clustering method of high dimensional data and systems, efficiently to realize
High dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
As shown in Figure 1, high dimensional data visualization clustering method the following steps are included:
Step 101: pretreatment is normalized to high dimensional data.
According to formulaThe high dimensional data is normalized and is pre-processed, wherein Fkm
WithRespectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min
(Fm) respectively indicate high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ...,
M, K and M respectively indicate the scale and dimension of high dimensional data F.
Step 102: dimension extension being carried out to the high dimensional data after normalized by multi-objective genetic algorithm, is tieed up
High dimensional data after degree extension.It specifically includes:
1) population of the multi-objective genetic algorithm is initialized;The population includes multiple individuals;The individual indicates institute
State the extended mode of high dimensional data.
2) Multi-target evaluation index is constructed;Multi-target evaluation index includes the extension dimension of the high dimensional data, topology guarantor
Hold index, Dunn index.It is specific:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined
Number.
According to formulaDetermine that the topology of each individual keeps index, wherein TP expression is opened up
Holding index is flutterred, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually
U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated,
nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to
Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k
Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates
Cluster CkDiameter.
3) go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extension shape
State.
4) dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, obtains dimension
High dimensional data after extension.Each dimension of high dimensional data after counting normalized r equal part in [0,1] value range
The probability of appearance determines the probability histogram of each dimension;Each probability histogram is carried out using neighbour's propagation clustering algorithm
It divides, determines each dimension division result;Dimension extension is carried out according to the division result and the optimal extended mode, is obtained
High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters,
Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension, which is equal to original
The beginning affiliated equal part of dimension data value, the data value in remaining dimension are 0.
Step 103: the high dimensional data after the extension of dimension described in each group being mapped into class using class circle mapping method for visualizing
Space of circles realizes the visualization cluster of high dimensional data.It specifically includes:
Structure constructs class space of circles CO, the class space of circles is that two-dimensional Cartesian coordinate system is empty by the unit circle in the center of circle of origin
Between;
According toBetween high dimensional data dimension after determining the extension of each group dimension
Correlation, obtain similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high dimensional data F
Scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to be expanded the dimension using 1 to M integer
The each group of data of high dimensional data after exhibition carries out the numerical value of mark sequence according to the attribute value size in each dimension;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, Fiedler is determined
Vector;
It is arranged according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension
Sequence, the high dimensional data after being sorted;
According to formulaEach dimension of high dimensional data after determining sequence is in COCircle
Coordinate points V on arcλ(i), whereinVector λ indicate Fiedler vector element size mark sequence to
Amount, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determine
Distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is tieed up for kth group data in λ (i)
On attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines polygon
Geometric center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon
The different cluster spacing of geometric center determines map point location, realizes high dimensional data visualization cluster.
The specific embodiment provided according to the present invention, the invention discloses following technical effects: the present invention is to high dimensional data
Pretreatment is normalized;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, is obtained
High dimensional data after dimension extension;The high dimensional data after the extension of dimension described in each group is mapped using class circle mapping method for visualizing
To class space of circles, realize the visualization cluster of high dimensional data, high dimensional data visualization clustering method provided by the invention and
System can ensure science, the validity of visualization clustering, so as to more efficiently realize high dimensional data especially
Visualization cluster comprising nonlinear organization high dimensional data.
Visualization clustering method below by taking the 4 dimension flag flower data sets that scale is 150 as an example, to this patent proposition
It introduces.
Step A: pretreatment is normalized to flag flower data set, is specifically included:
According to formulaFlag flower data set F is normalized and is pre-processed, wherein Fkm
WithRespectively indicate kth group flag flower data set in m dimension original property value and normalization after attribute value;max(Fm) and
min(Fm) respectively indicate flag flower data set attribute value minimum and maximum in m dimension;K=1,2 ..., 150, m=1,2,
3,4;
Step B: dimension expansion is carried out to the flag flower data set after normalized by NSGAII multi-objective genetic algorithm
Exhibition, the flag flower data set after obtaining dimension extension, specifically includes:
Initialize the population of the NSGAII multi-objective genetic algorithm;The population includes multiple individuals;A body surface
Show that the binary-coded extended mode of the high dimensional data, length are flag flower data set dimension 4, wherein in binary coding
1 and 0 respectively indicate corresponding flag flower data set dimension and carry out and extended without dimension;
Multi-target evaluation index is constructed, Multi-target evaluation index includes the extension dimension of the flag flower data set, topology
Keep index, Dunn index;
Go out optimal individual by the Multi-target evaluation index screening, the optimal individual indicates flag flower data set
Optimal extended mode;
Dimension extension is carried out to the flag flower data set after normalized according to the optimal extended mode, is tieed up
Flag flower data set after degree extension, specifically includes:
Each dimension of flag flower data set after counting the normalized occurs for 20 equal points in [0,1] value range
Probability, determine this 4 dimension probability histograms;
Each 4 probability histograms are divided using neighbour's propagation clustering algorithm, determine that 4 dimensions divide knot
Fruit, the division of the probability distribution can be regarded as clustering 2-D data, and the 2-D data is respectively each dimension
The x-axis (i.e. value) and y-axis (i.e. probability value) of probability distribution histogram.Fig. 3 illustrates the probability of 4 dimensions of flag flower data set
Histogram and division result, 2-D data coordinate is indicated with scatterplot in figure, and same division class scatterplot is connected with same type broken line.
Dimension extension is carried out according to the division result and the optimal extended mode, the higher-dimension after obtaining dimension extension
Flag flower data set, wherein the dimension after 4 dimension extensions is equal to corresponding probability distribution histogram and clusters number of clusters, each dimension
Data after extension have and only one-dimensional data value is equal to the data value in corresponding original dimension, which is equal to original number of dimensions
According to affiliated equal part is worth, the data value in remaining dimension is 0.Such as Fig. 3 illustrates that the first dimension of flag flower data set is divided into 3
Part, including data point are respectively 6,7 and 7.That is the first dimension of Iris data set is extended to three new dimensions,
And it is divided in the place that probability is 0.3 and 0.65.It follows that if 3 groups of data of the first dimension of flag flower data set
Value is respectively 0.2,0.5,0.8, then its value in new dimension is respectively [0.200], [00.50], [000.8].
Step C: using class circle mapping method for visualizing respectively by the higher-dimension flag flower data after each group dimension extension
Collection maps to class space of circles, specifically includes:
Class space of circles is constructed, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toFlag flower data set after determining the extension of each group dimension
Correlation between dimension, obtains similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high
The scale of dimension data F, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is will be described using 1 to N number of integer
The each group of data of flag flower data after dimension extension is carried out the numerical value of mark sequence by its attribute value size in each dimension,
In, N is the dimension of flag flower data set after dimension extension;
The corresponding feature vector of Laplace matrix maximum eigenvalue by solving the similar matrix determines Fiedler
Vector;
According to element size in Fiedler vector to the dimension of the higher-dimension flag flower data set after each group dimension extension
It is ranked up, the high dimensional data after being sorted;
According to formulaHigher-dimension flag flower data set after determining sequence is respectively tieed up
Degree is in COCoordinate points V on circular arcλ(i), whereinVector λ indicates Fiedler vector element size
Sequence vector is marked, λ (i) indicates that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to the flag flower data set after the extension of any dimensionIn coordinate origin and coordinate points Vλ(i)
On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is
Attribute value of the k group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group flag flower data set, and is determined
The geometric center of polygon;
The same cluster spacing that the polygon geometric center is reduced by t-SNE algorithm, increases the polygon geometric center
Different cluster spacing determine map point location, realize flag flower data set visualization cluster.
As shown in Fig. 2, the present invention also provides a kind of high dimensional datas to visualize cluster analysis system, the system comprises:
Preprocessing module 201, for pretreatment to be normalized to high dimensional data.According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWithRespectively indicate kth
Group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) respectively indicate high dimension
According to F in m dimension maximum attribute value and minimum attribute value;K=1,2 ..., K, m=1,2 ..., M, K and M respectively indicate higher-dimension
The scale and dimension of data F.
Dimension expansion module 202, for being tieed up by multi-objective genetic algorithm to the high dimensional data after normalized
Degree extension, the high dimensional data after obtaining dimension extension.
The dimension expansion module 202, specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Institute
Stating individual indicates the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the high dimensional data
Extend dimension, topology
Keep index, Dunn index;Specifically, by the individual binary coding each in statistics population 1 number,
Determine the extension dimension of the high dimensional data;According to formulaDetermine that the topology of each individual is protected
Hold index, wherein TP indicates that topology keeps index, and K indicates the scale of high dimensional data F, tkIndicate the grade row of kth group data
Sequence, according to formulaIt determines, u and s indicate arest neighbors data
Point number, usual u=4, s=10, NNkyAnd nnkyIt respectively indicates luv space and mapping space kth group data point y closest
Data point, nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to
Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k
Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates
Cluster CkDiameter.
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual is indicated most
Excellent extended mode;
Dimension expanding element, for being tieed up according to the optimal extended mode to the high dimensional data after normalized
Degree extension, the high dimensional data after obtaining dimension extension.
The dimension expanding element, specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after the normalized in [0,1] value range
Upper r etc. points of probability occurred, determine the probability histogram of each dimension;
Subelement is divided, for being divided using neighbour's propagation clustering algorithm to each probability histogram, is determined each
Dimension division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, is obtained
High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters,
Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Mapping block 203, for the high dimensional data after being extended dimension described in each group using class circle mapping method for visualizing
Class space of circles is mapped to, realizes the visualization cluster of high dimensional data.
The mapping block 203, specifically includes:
Similar matrix determination unit is used for basisDetermine that each group dimension expands
The correlation between high dimensional data dimension after exhibition, obtains similar matrix, wherein SijIt is arranged for the i-th row jth in the similar matrix
Element, K indicate high dimensional data F scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1
The each group of data of the high dimensional data after dimension extension is marked according to the attribute value size in each dimension to M integer
The numerical value of sequence;
Fiedler vector determination unit, for the Laplace matrix maximum eigenvalue pair by solving the similar matrix
The feature vector answered determines Fiedler vector;
Sequencing unit, for the high dimensional data after being extended according to element size in Fiedler vector to each group dimension
Dimension be ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaAfter determining sequence
Each dimension of high dimensional data is in COCoordinate points V on circular arcλ(i), whereinVector λ indicate Fiedler to
The mark sequence vector of secondary element size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the high dimension after sequence
According to dimension;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and sit
Punctuate Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinFor attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one for being made up of the corresponding two-dimensional space point set of each group of data
Polygon, and determine the geometric center of polygon;
Visualization cluster realizes unit, reduces the polygon geometric center for being distributed neighborhood embedded mobile GIS by t-
With cluster spacing, the different cluster spacing for increasing the polygon geometric center determines map point location, realizes that high dimensional data visualization is poly-
Class.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said
It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation
Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not
It is interpreted as limitation of the present invention.
Claims (10)
1. a kind of high dimensional data visualizes clustering method, which is characterized in that the described method includes:
Pretreatment is normalized to high dimensional data;
Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the height after obtaining dimension extension
Dimension data;
The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, is realized high
The visualization of dimension data clusters.
2. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to high dimensional data
Pretreatment is normalized, specifically includes:
According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWith
Respectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) point
It Biao Shi not high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ..., M, K and M
Respectively indicate the scale and dimension of high dimensional data F.
3. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to pass through multiple target
Genetic algorithm carries out dimension extension to the high dimensional data after normalized, the high dimensional data after obtaining dimension extension, specific to wrap
It includes:
Initialize the population of the multi-objective genetic algorithm;The population includes multiple individuals;The individual indicates the higher-dimension
The extended mode of data;
Construct Multi-target evaluation index;Multi-target evaluation index include the high dimensional data extension dimension, topology keep index,
Dunn index;
Go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extended mode;
Dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, after obtaining dimension extension
High dimensional data.
4. high dimensional data according to claim 3 visualizes clustering method, which is characterized in that the building multiple target
Evaluation index specifically includes:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined;
According to formulaDetermine that the topology of each individual keeps index, wherein TP indicates that topology is protected
Index is held, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually
U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated,
nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine each individual Dunn index, DI
Indicate Dunn index, d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate the cluster of mapping point i, j, k
Cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;Indicate cluster Ck
Diameter.
5. high dimensional data according to claim 3 visualizes clustering method, which is characterized in that it is described according to most
Excellent extended mode carries out dimension extension to the high dimensional data after normalized, the high dimensional data after obtaining dimension extension, tool
Body includes:
Each dimension of high dimensional data after counting normalized r etc. points of probability occurred in [0,1] value range, really
The probability histogram of fixed each dimension;
Each probability histogram is divided using neighbour's propagation clustering algorithm, determines each dimension division result;
Dimension extension is carried out according to the division result and the optimal extended mode, the high dimension after obtaining dimension extension
According to, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, the number after each dimension extension
According to having and only one-dimensional data value is equal to the data value in corresponding original dimension.
6. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to be reflected using class circle
It penetrates method for visualizing and the high dimensional data after the extension of dimension described in each group is mapped into class space of circles, realize the visualization of high dimensional data
Cluster, specifically includes:
Construct class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toThe phase between high dimensional data dimension after determining the extension of each group dimension
Guan Xing obtains similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates the rule of high dimensional data F
Mould, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is will be after dimension extension using 1 to M integer
High dimensional data each group of data according to the attribute value size in each dimension carry out mark sequence numerical value;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, determine Fiedler to
Amount;
It is ranked up, is obtained according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension
High dimensional data after to sequence;
According to formulaEach dimension of high dimensional data after determining sequence is in COOn circular arc
Coordinate points Vλ(i), whereinVector λ indicates the mark sequence vector of Fiedler vector element size, λ
(i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determines and arrive institute
The distance for stating coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is kth group data in λ (i) dimension
Attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines the geometry of polygon
Center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon geometry
The different cluster spacing at center determines map point location, realizes high dimensional data visualization cluster.
7. a kind of high dimensional data visualizes cluster analysis system, which is characterized in that the system comprises:
Preprocessing module, for pretreatment to be normalized to high dimensional data;
Dimension expansion module, for carrying out dimension extension to the high dimensional data after normalized by multi-objective genetic algorithm,
High dimensional data after obtaining dimension extension;
Mapping block, for the high dimensional data after the extension of dimension described in each group to be mapped to class using class circle mapping method for visualizing
Space of circles realizes the visualization cluster of high dimensional data.
8. high dimensional data according to claim 7 visualizes cluster analysis system, which is characterized in that the dimension expanded mode
Block specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Described
Body surface shows the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the extension of the high dimensional data
Dimension, topology keep index, Dunn index;
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal
Extended mode;
Dimension expanding element, for carrying out dimension expansion to the high dimensional data after normalized according to the optimal extended mode
Exhibition, the high dimensional data after obtaining dimension extension.
9. high dimensional data according to claim 8 visualizes cluster analysis system, which is characterized in that the dimension extension is single
Member specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after normalized r etc. in [0,1] value range
Existing probability is separated, determines the probability histogram of each dimension;
It divides subelement and determines each dimension for dividing using neighbour's propagation clustering algorithm to each probability histogram
Division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, obtains dimension
High dimensional data after extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, each
Data after dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
10. high dimensional data according to claim 7 visualizes cluster analysis system, which is characterized in that the mapping block,
It specifically includes:
Class space of circles construction unit, for constructing class space of circles CO, it with origin is round that the class space of circles, which is two-dimensional Cartesian coordinate system,
The unit space of circles of the heart;
Similar matrix determination unit is used for basisAfter determining the extension of each group dimension
High dimensional data dimension between correlation, obtain similar matrix, wherein SijThe member arranged for the i-th row jth in the similar matrix
Element, K indicate the scale of high dimensional data F, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1 to M
The each group of data of high dimensional data after dimension extension is carried out mark sequence according to the attribute value size in each dimension by integer
Numerical value;
Fiedler vector determination unit, it is corresponding for the Laplace matrix maximum eigenvalue by solving the similar matrix
Feature vector determines Fiedler vector;
Sequencing unit, for the dimension according to element size in Fiedler vector to the high dimensional data after each group dimension extension
Degree is ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaHigh dimension after determining sequence
According to each dimension in COCoordinate points V on circular arcλ(i), whereinVector λ indicates Fiedler vector element
The mark sequence vector of size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence
Number;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and coordinate points
Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, wherein
For attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one polygon for being made up of the corresponding two-dimensional space point set of each group of data
Shape, and determine the geometric center of polygon;
Visualization cluster realizes unit, for being distributed the same cluster that neighborhood embedded mobile GIS reduces the polygon geometric center by t-
Spacing, the different cluster spacing for increasing the polygon geometric center determine map point location, realize high dimensional data visualization cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811517242.2A CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811517242.2A CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109271441A true CN109271441A (en) | 2019-01-25 |
CN109271441B CN109271441B (en) | 2020-09-01 |
Family
ID=65187645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811517242.2A Active CN109271441B (en) | 2018-12-12 | 2018-12-12 | High-dimensional data visual clustering analysis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109271441B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162568A (en) * | 2019-05-24 | 2019-08-23 | 东北大学 | A kind of three-dimensional data method for visualizing based on PCA-Radviz |
CN110308873A (en) * | 2019-06-24 | 2019-10-08 | 浙江大华技术股份有限公司 | A kind of date storage method, device, equipment and medium |
CN110458187A (en) * | 2019-06-27 | 2019-11-15 | 广州大学 | A kind of malicious code family clustering method and system |
CN110781569A (en) * | 2019-11-08 | 2020-02-11 | 桂林电子科技大学 | Multi-resolution grid division based anomaly detection method and system |
CN113095427A (en) * | 2021-04-23 | 2021-07-09 | 中南大学 | High-dimensional data analysis method and face data analysis method based on user guidance |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764676A (en) * | 2018-05-17 | 2018-11-06 | 南昌航空大学 | A kind of higher-dimension multi-objective assessment method and system |
-
2018
- 2018-12-12 CN CN201811517242.2A patent/CN109271441B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764676A (en) * | 2018-05-17 | 2018-11-06 | 南昌航空大学 | A kind of higher-dimension multi-objective assessment method and system |
Non-Patent Citations (3)
Title |
---|
ANDREAS KÖNIG: "Interactive Visualization and Analysis of Hierarchical Neural Projections for Data Mining", 《IEEE TRANSACTIONS ON NEURAL NETWORKS》 * |
JOHN F. MCCARTHY ET AL.: "Applications of Machine Learning and High‐Dimensional Visualization in Cancer Detection, Diagnosis, and Management", 《ANNALS OF THE NEW YORK ACADEMY OF SCIENCES》 * |
周芳芳 等: "基于维度扩展的Radviz可视化聚类分析方法", 《软件学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162568A (en) * | 2019-05-24 | 2019-08-23 | 东北大学 | A kind of three-dimensional data method for visualizing based on PCA-Radviz |
CN110308873A (en) * | 2019-06-24 | 2019-10-08 | 浙江大华技术股份有限公司 | A kind of date storage method, device, equipment and medium |
CN110458187A (en) * | 2019-06-27 | 2019-11-15 | 广州大学 | A kind of malicious code family clustering method and system |
CN110781569A (en) * | 2019-11-08 | 2020-02-11 | 桂林电子科技大学 | Multi-resolution grid division based anomaly detection method and system |
CN110781569B (en) * | 2019-11-08 | 2023-12-19 | 桂林电子科技大学 | Abnormality detection method and system based on multi-resolution grid division |
CN113095427A (en) * | 2021-04-23 | 2021-07-09 | 中南大学 | High-dimensional data analysis method and face data analysis method based on user guidance |
Also Published As
Publication number | Publication date |
---|---|
CN109271441B (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271441A (en) | A kind of visualization clustering method of high dimensional data and system | |
Wang et al. | CE3: A three-way clustering method based on mathematical morphology | |
CN110096500B (en) | Visual analysis method and system for ocean multidimensional data | |
Zhang et al. | Local density adaptive similarity measurement for spectral clustering | |
Li et al. | Comparative density peaks clustering | |
Li et al. | A clustering method based on K-means algorithm | |
CN111199214B (en) | Residual network multispectral image ground object classification method | |
CN109993748A (en) | A kind of three-dimensional grid method for segmenting objects based on points cloud processing network | |
Wang et al. | Learning context-sensitive similarity by shortest path propagation | |
Rieck et al. | Multivariate data analysis using persistence-based filtering and topological signatures | |
CN105354593B (en) | A kind of threedimensional model sorting technique based on NMF | |
CN106055580B (en) | A kind of fuzzy clustering result visualization method based on Radviz | |
CN112990010B (en) | Point cloud data processing method and device, computer equipment and storage medium | |
CN110210431A (en) | A kind of point cloud classifications method based on cloud semantic tagger and optimization | |
CN106650744A (en) | Image object co-segmentation method guided by local shape migration | |
CN102890703A (en) | Network heterogeneous multidimensional scaling (HMDS) method | |
CN108764676A (en) | A kind of higher-dimension multi-objective assessment method and system | |
Xu et al. | Enhancing 2D representation via adjacent views for 3D shape retrieval | |
CN114120067A (en) | Object identification method, device, equipment and medium | |
Poojitha et al. | A collocation of IRIS flower using neural network clustering tool in MATLAB | |
CN114332172A (en) | Improved laser point cloud registration method based on covariance matrix | |
CN106872972B (en) | Near space Electromagnetic Scattering of Target data capture method based on sextuple interpolation | |
Speer et al. | Clustering gene expression data with memetic algorithms based on minimum spanning trees | |
CN113673619B (en) | Geographic big data space latent pattern analysis method based on topology analysis | |
CN115690439A (en) | Feature point aggregation method and device based on image plane and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |