CN109271441A - A kind of visualization clustering method of high dimensional data and system - Google Patents

A kind of visualization clustering method of high dimensional data and system Download PDF

Info

Publication number
CN109271441A
CN109271441A CN201811517242.2A CN201811517242A CN109271441A CN 109271441 A CN109271441 A CN 109271441A CN 201811517242 A CN201811517242 A CN 201811517242A CN 109271441 A CN109271441 A CN 109271441A
Authority
CN
China
Prior art keywords
dimension
dimensional data
high dimensional
data
extension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811517242.2A
Other languages
Chinese (zh)
Other versions
CN109271441B (en
Inventor
黎明
黄珊
陈昊
陈震
李军华
张聪炫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Hangkong University
Original Assignee
Nanchang Hangkong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Hangkong University filed Critical Nanchang Hangkong University
Priority to CN201811517242.2A priority Critical patent/CN109271441B/en
Publication of CN109271441A publication Critical patent/CN109271441A/en
Application granted granted Critical
Publication of CN109271441B publication Critical patent/CN109271441B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of visualization clustering method of high dimensional data and systems.This method comprises: pretreatment is normalized to high dimensional data;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the high dimensional data after obtaining dimension extension;The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, realizes the visualization cluster of high dimensional data.This method or system can efficiently realize high dimensional data especially and include the visualization cluster of nonlinear organization high dimensional data.

Description

A kind of visualization clustering method of high dimensional data and system
Technical field
The present invention relates to high dimensional datas to visualize cluster field, visualizes clustering more particularly to a kind of high dimensional data Method and system.
Background technique
Visualization technique is a kind of important data analysis tool, mainly utilizes computer graphics, image procossing, letter The internal structure, information and knowledge of data are expressed in the methods of number processing, are conducive to the research such as pattern-recognition, outlier detection. With the rapid development of computer and sensing equipment, multidimensional even high dimensional data has been widely present economy, medicine, military affairs and industry Equal fields, such as higher-dimension functional magnetic resonance imaging data, three layers of defense system of multidimensional structure etc..The increasing of data dimension and scale It adds to data visualization and brings new opportunity.But traditional rectangular co-ordinate most multipotency expresses three-dimensional data, is not suitable for height Dimension data visual research.
There are two main classes for higher-dimension visualization technique at present.Wherein, one kind is dimension reduction method, and high dimensional data is mapped to low-dimensional Space, and the data after indicating dimensionality reduction with scatterplot or other symbols.It is mainly reflected including principal component analysis, self-organizing, neural elementary length Amount method etc..Although Approach of Dimension Reduction for Visualization can overcome the dimension disaster of visualization technique in some sense, may lead The loss for causing potential important information restricts the accuracy of High dimensional data analysis.Another kind of method is without using dimensionality reduction technology In the case of obtain visualization result, such as scatterplot matrices, parallel coordinate system and hotspot graph, can indicate intactly higher-dimension Data information.But with the increase of data dimension and scale, due to the limitation of screen, a large amount of curve or color lump can be intricately Weave in greatlys restrict visual validity.
Compared to the above method, with radial coordinate method for visualizing (Radial Visualization, RadViz) and star Coordinate (Star Coordinates, SC) is that the radial layout method for visualizing of representative is expressing high dimensional data with apparent excellent Gesture.Radial layout method for visualizing utilizes circular radius characterize data dimension, and each individual is mapped to the one of lower dimensional space A point.Arbitrary Dimensions evidence can not only be efficiently expressed in lower dimensional space, and can be by the data projection with similar features to phase Close position, to form preferable visualization Clustering Effect.But RadViz is defined as the general shape for not considering data The Nonlinear Mapping of shape and distribution;And SC itself is a kind of linear method for visualizing.Therefore when data are non-linearity manifold structure When, there are limitations in capture nonlinear data structure for traditional radial layout method for visualizing.
Therefore, how efficiently to realize high dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data, The technical issues of as those skilled in the art's urgent need to resolve.
Summary of the invention
The object of the present invention is to provide a kind of visualization clustering method of high dimensional data and systems, efficiently to realize High dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data.
To achieve the above object, the present invention provides following schemes:
A kind of high dimensional data visualization clustering method, which comprises
Pretreatment is normalized to high dimensional data;
Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, after obtaining dimension extension High dimensional data;
The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, it is real The visualization cluster of existing high dimensional data.
Optionally, described that pretreatment is normalized to high dimensional data, it specifically includes:
According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWithRespectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) Respectively indicate high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ..., M, K The scale and dimension of high dimensional data F are respectively indicated with M.
Optionally, described that dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, it obtains High dimensional data to after dimension extension, specifically includes:
Initialize the population of the multi-objective genetic algorithm;The population includes multiple individuals;Described in the individual indicates The extended mode of high dimensional data;
Construct Multi-target evaluation index;Multi-target evaluation index includes the extension dimension of the high dimensional data, topology holding Index, Dunn index;
Go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extended mode;
Dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, obtains dimension expansion High dimensional data after exhibition.
Optionally, the building Multi-target evaluation index, specifically includes:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined Number;
According to formulaDetermine that the topology of each individual keeps index, wherein TP expression is opened up Holding index is flutterred, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated, nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates Cluster CkDiameter.
Optionally, described that dimension expansion is carried out to the high dimensional data after normalized according to the optimal extended mode Exhibition, the high dimensional data after obtaining dimension extension, specifically includes:
Each dimension of high dimensional data after counting the normalized r etc. in [0,1] value range divide the general of appearance Rate determines the probability histogram of each dimension;
Each probability histogram is divided using neighbour's propagation clustering algorithm, determines each dimension division result;
Dimension extension is carried out according to the division result and the optimal extended mode, the higher-dimension after obtaining dimension extension Data, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, after each dimension extension Data have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Optionally, described to be mapped to the high dimensional data after the extension of dimension described in each group using class circle mapping method for visualizing Class space of circles is realized the visualization cluster of high dimensional data, is specifically included:
Construct class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toBetween high dimensional data dimension after determining the extension of each group dimension Correlation, obtain similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high dimensional data F Scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to be expanded the dimension using 1 to M integer The each group of data of high dimensional data after exhibition carries out the numerical value of mark sequence according to the attribute value size in each dimension;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, Fiedler is determined Vector;
It is arranged according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension Sequence, the high dimensional data after being sorted;
According to formulaEach dimension of high dimensional data after determining sequence is in COCircle Coordinate points V on arcλ(i), whereinVector λ indicate Fiedler vector element size mark sequence to Amount, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determine Distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is tieed up for kth group data in λ (i) On attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines polygon Geometric center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon The different cluster spacing of geometric center determines map point location, realizes high dimensional data visualization cluster.
A kind of high dimensional data visualization cluster analysis system, the system comprises:
Preprocessing module, for pretreatment to be normalized to high dimensional data;
Dimension expansion module, for carrying out dimension expansion to the high dimensional data after normalized by multi-objective genetic algorithm Exhibition, the high dimensional data after obtaining dimension extension;
Mapping block, for being mapped the high dimensional data after the extension of dimension described in each group using class circle mapping method for visualizing To class space of circles, the visualization cluster of high dimensional data is realized.
Optionally, the dimension expansion module, specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Institute Stating individual indicates the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the high dimensional data Extend dimension, topology keeps index, Dunn index;
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual is indicated most Excellent extended mode;
Dimension expanding element, for being tieed up according to the optimal extended mode to the high dimensional data after normalized Degree extension, the high dimensional data after obtaining dimension extension.
Optionally, the dimension expanding element, specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after the normalized in [0,1] value range Upper r etc. points of probability occurred, determine the probability histogram of each dimension;
Subelement is divided, for being divided using neighbour's propagation clustering algorithm to each probability histogram, is determined each Dimension division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, is obtained High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Optionally, the mapping block, specifically includes:
Class space of circles construction unit, for constructing class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system with original Point is the unit space of circles in the center of circle;
Similar matrix determination unit is used for basisDetermine that each group dimension expands The correlation between high dimensional data dimension after exhibition, obtains similar matrix, wherein SijIt is arranged for the i-th row jth in the similar matrix Element, K indicate high dimensional data F scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1 The each group of data of the high dimensional data after dimension extension is marked according to the attribute value size in each dimension to M integer The numerical value of sequence;
Fiedler vector determination unit, for the Laplace matrix maximum eigenvalue pair by solving the similar matrix The feature vector answered determines Fiedler vector;
Sequencing unit, for the high dimensional data after being extended according to element size in Fiedler vector to each group dimension Dimension be ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaAfter determining sequence Each dimension of high dimensional data is in COCoordinate points V on circular arcλ(i), whereinVector λ indicate Fiedler to The mark sequence vector of secondary element size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the high dimension after sequence According to dimension;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and sit Punctuate Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinFor attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one for being made up of the corresponding two-dimensional space point set of each group of data Polygon, and determine the geometric center of polygon;
Visualization cluster realizes unit, reduces the polygon geometric center for being distributed neighborhood embedded mobile GIS by t- With cluster spacing, the different cluster spacing for increasing the polygon geometric center determines map point location, realizes that high dimensional data visualization is poly- Class.
Compared with prior art, the present invention has following technical effect that pre- place is normalized to high dimensional data in the present invention Reason;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the height after obtaining dimension extension Dimension data;The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, it is real The visualization cluster of existing high dimensional data, high dimensional data visualization clustering method provided by the invention and system can ensure can Science, validity depending on changing clustering, so as to more efficiently realize that high dimensional data especially includes nonlinear organization The visualization of high dimensional data clusters.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is that high dimensional data of the embodiment of the present invention visualizes clustering method flow chart;
Fig. 2 is the structural block diagram that high dimensional data of the embodiment of the present invention visualizes cluster analysis system;
Fig. 3 is the embodiment of the present invention as r=20, the probability histogram and division result of each dimension of flag flower data set Schematic diagram.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of visualization clustering method of high dimensional data and systems, efficiently to realize High dimensional data especially includes the visualization cluster of nonlinear organization high dimensional data.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
As shown in Figure 1, high dimensional data visualization clustering method the following steps are included:
Step 101: pretreatment is normalized to high dimensional data.
According to formulaThe high dimensional data is normalized and is pre-processed, wherein Fkm WithRespectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) respectively indicate high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ..., M, K and M respectively indicate the scale and dimension of high dimensional data F.
Step 102: dimension extension being carried out to the high dimensional data after normalized by multi-objective genetic algorithm, is tieed up High dimensional data after degree extension.It specifically includes:
1) population of the multi-objective genetic algorithm is initialized;The population includes multiple individuals;The individual indicates institute State the extended mode of high dimensional data.
2) Multi-target evaluation index is constructed;Multi-target evaluation index includes the extension dimension of the high dimensional data, topology guarantor Hold index, Dunn index.It is specific:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined Number.
According to formulaDetermine that the topology of each individual keeps index, wherein TP expression is opened up Holding index is flutterred, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated, nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates Cluster CkDiameter.
3) go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extension shape State.
4) dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, obtains dimension High dimensional data after extension.Each dimension of high dimensional data after counting normalized r equal part in [0,1] value range The probability of appearance determines the probability histogram of each dimension;Each probability histogram is carried out using neighbour's propagation clustering algorithm It divides, determines each dimension division result;Dimension extension is carried out according to the division result and the optimal extended mode, is obtained High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension, which is equal to original The beginning affiliated equal part of dimension data value, the data value in remaining dimension are 0.
Step 103: the high dimensional data after the extension of dimension described in each group being mapped into class using class circle mapping method for visualizing Space of circles realizes the visualization cluster of high dimensional data.It specifically includes:
Structure constructs class space of circles CO, the class space of circles is that two-dimensional Cartesian coordinate system is empty by the unit circle in the center of circle of origin Between;
According toBetween high dimensional data dimension after determining the extension of each group dimension Correlation, obtain similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high dimensional data F Scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to be expanded the dimension using 1 to M integer The each group of data of high dimensional data after exhibition carries out the numerical value of mark sequence according to the attribute value size in each dimension;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, Fiedler is determined Vector;
It is arranged according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension Sequence, the high dimensional data after being sorted;
According to formulaEach dimension of high dimensional data after determining sequence is in COCircle Coordinate points V on arcλ(i), whereinVector λ indicate Fiedler vector element size mark sequence to Amount, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determine Distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is tieed up for kth group data in λ (i) On attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines polygon Geometric center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon The different cluster spacing of geometric center determines map point location, realizes high dimensional data visualization cluster.
The specific embodiment provided according to the present invention, the invention discloses following technical effects: the present invention is to high dimensional data Pretreatment is normalized;Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, is obtained High dimensional data after dimension extension;The high dimensional data after the extension of dimension described in each group is mapped using class circle mapping method for visualizing To class space of circles, realize the visualization cluster of high dimensional data, high dimensional data visualization clustering method provided by the invention and System can ensure science, the validity of visualization clustering, so as to more efficiently realize high dimensional data especially Visualization cluster comprising nonlinear organization high dimensional data.
Visualization clustering method below by taking the 4 dimension flag flower data sets that scale is 150 as an example, to this patent proposition It introduces.
Step A: pretreatment is normalized to flag flower data set, is specifically included:
According to formulaFlag flower data set F is normalized and is pre-processed, wherein Fkm WithRespectively indicate kth group flag flower data set in m dimension original property value and normalization after attribute value;max(Fm) and min(Fm) respectively indicate flag flower data set attribute value minimum and maximum in m dimension;K=1,2 ..., 150, m=1,2, 3,4;
Step B: dimension expansion is carried out to the flag flower data set after normalized by NSGAII multi-objective genetic algorithm Exhibition, the flag flower data set after obtaining dimension extension, specifically includes:
Initialize the population of the NSGAII multi-objective genetic algorithm;The population includes multiple individuals;A body surface Show that the binary-coded extended mode of the high dimensional data, length are flag flower data set dimension 4, wherein in binary coding 1 and 0 respectively indicate corresponding flag flower data set dimension and carry out and extended without dimension;
Multi-target evaluation index is constructed, Multi-target evaluation index includes the extension dimension of the flag flower data set, topology Keep index, Dunn index;
Go out optimal individual by the Multi-target evaluation index screening, the optimal individual indicates flag flower data set Optimal extended mode;
Dimension extension is carried out to the flag flower data set after normalized according to the optimal extended mode, is tieed up Flag flower data set after degree extension, specifically includes:
Each dimension of flag flower data set after counting the normalized occurs for 20 equal points in [0,1] value range Probability, determine this 4 dimension probability histograms;
Each 4 probability histograms are divided using neighbour's propagation clustering algorithm, determine that 4 dimensions divide knot Fruit, the division of the probability distribution can be regarded as clustering 2-D data, and the 2-D data is respectively each dimension The x-axis (i.e. value) and y-axis (i.e. probability value) of probability distribution histogram.Fig. 3 illustrates the probability of 4 dimensions of flag flower data set Histogram and division result, 2-D data coordinate is indicated with scatterplot in figure, and same division class scatterplot is connected with same type broken line.
Dimension extension is carried out according to the division result and the optimal extended mode, the higher-dimension after obtaining dimension extension Flag flower data set, wherein the dimension after 4 dimension extensions is equal to corresponding probability distribution histogram and clusters number of clusters, each dimension Data after extension have and only one-dimensional data value is equal to the data value in corresponding original dimension, which is equal to original number of dimensions According to affiliated equal part is worth, the data value in remaining dimension is 0.Such as Fig. 3 illustrates that the first dimension of flag flower data set is divided into 3 Part, including data point are respectively 6,7 and 7.That is the first dimension of Iris data set is extended to three new dimensions, And it is divided in the place that probability is 0.3 and 0.65.It follows that if 3 groups of data of the first dimension of flag flower data set Value is respectively 0.2,0.5,0.8, then its value in new dimension is respectively [0.200], [00.50], [000.8].
Step C: using class circle mapping method for visualizing respectively by the higher-dimension flag flower data after each group dimension extension Collection maps to class space of circles, specifically includes:
Class space of circles is constructed, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toFlag flower data set after determining the extension of each group dimension Correlation between dimension, obtains similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates high The scale of dimension data F, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is will be described using 1 to N number of integer The each group of data of flag flower data after dimension extension is carried out the numerical value of mark sequence by its attribute value size in each dimension, In, N is the dimension of flag flower data set after dimension extension;
The corresponding feature vector of Laplace matrix maximum eigenvalue by solving the similar matrix determines Fiedler Vector;
According to element size in Fiedler vector to the dimension of the higher-dimension flag flower data set after each group dimension extension It is ranked up, the high dimensional data after being sorted;
According to formulaHigher-dimension flag flower data set after determining sequence is respectively tieed up Degree is in COCoordinate points V on circular arcλ(i), whereinVector λ indicates Fiedler vector element size Sequence vector is marked, λ (i) indicates that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to the flag flower data set after the extension of any dimensionIn coordinate origin and coordinate points Vλ(i) On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is Attribute value of the k group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group flag flower data set, and is determined The geometric center of polygon;
The same cluster spacing that the polygon geometric center is reduced by t-SNE algorithm, increases the polygon geometric center Different cluster spacing determine map point location, realize flag flower data set visualization cluster.
As shown in Fig. 2, the present invention also provides a kind of high dimensional datas to visualize cluster analysis system, the system comprises:
Preprocessing module 201, for pretreatment to be normalized to high dimensional data.According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWithRespectively indicate kth Group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) respectively indicate high dimension According to F in m dimension maximum attribute value and minimum attribute value;K=1,2 ..., K, m=1,2 ..., M, K and M respectively indicate higher-dimension The scale and dimension of data F.
Dimension expansion module 202, for being tieed up by multi-objective genetic algorithm to the high dimensional data after normalized Degree extension, the high dimensional data after obtaining dimension extension.
The dimension expansion module 202, specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Institute Stating individual indicates the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the high dimensional data Extend dimension, topology
Keep index, Dunn index;Specifically, by the individual binary coding each in statistics population 1 number, Determine the extension dimension of the high dimensional data;According to formulaDetermine that the topology of each individual is protected Hold index, wherein TP indicates that topology keeps index, and K indicates the scale of high dimensional data F, tkIndicate the grade row of kth group data Sequence, according to formulaIt determines, u and s indicate arest neighbors data Point number, usual u=4, s=10, NNkyAnd nnkyIt respectively indicates luv space and mapping space kth group data point y closest Data point, nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine that each individual Dunn refers to Mark, DI indicate Dunn index, and d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate mapping point i, j, k Clustering cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;It indicates Cluster CkDiameter.
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual is indicated most Excellent extended mode;
Dimension expanding element, for being tieed up according to the optimal extended mode to the high dimensional data after normalized Degree extension, the high dimensional data after obtaining dimension extension.
The dimension expanding element, specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after the normalized in [0,1] value range Upper r etc. points of probability occurred, determine the probability histogram of each dimension;
Subelement is divided, for being divided using neighbour's propagation clustering algorithm to each probability histogram, is determined each Dimension division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, is obtained High dimensional data after dimension extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, Data after each dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
Mapping block 203, for the high dimensional data after being extended dimension described in each group using class circle mapping method for visualizing Class space of circles is mapped to, realizes the visualization cluster of high dimensional data.
The mapping block 203, specifically includes:
Similar matrix determination unit is used for basisDetermine that each group dimension expands The correlation between high dimensional data dimension after exhibition, obtains similar matrix, wherein SijIt is arranged for the i-th row jth in the similar matrix Element, K indicate high dimensional data F scale, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1 The each group of data of the high dimensional data after dimension extension is marked according to the attribute value size in each dimension to M integer The numerical value of sequence;
Fiedler vector determination unit, for the Laplace matrix maximum eigenvalue pair by solving the similar matrix The feature vector answered determines Fiedler vector;
Sequencing unit, for the high dimensional data after being extended according to element size in Fiedler vector to each group dimension Dimension be ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaAfter determining sequence Each dimension of high dimensional data is in COCoordinate points V on circular arcλ(i), whereinVector λ indicate Fiedler to The mark sequence vector of secondary element size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the high dimension after sequence According to dimension;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and sit Punctuate Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, whereinFor attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one for being made up of the corresponding two-dimensional space point set of each group of data Polygon, and determine the geometric center of polygon;
Visualization cluster realizes unit, reduces the polygon geometric center for being distributed neighborhood embedded mobile GIS by t- With cluster spacing, the different cluster spacing for increasing the polygon geometric center determines map point location, realizes that high dimensional data visualization is poly- Class.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (10)

1. a kind of high dimensional data visualizes clustering method, which is characterized in that the described method includes:
Pretreatment is normalized to high dimensional data;
Dimension extension is carried out to the high dimensional data after normalized by multi-objective genetic algorithm, the height after obtaining dimension extension Dimension data;
The high dimensional data after the extension of dimension described in each group is mapped into class space of circles using class circle mapping method for visualizing, is realized high The visualization of dimension data clusters.
2. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to high dimensional data Pretreatment is normalized, specifically includes:
According to formulaThe high dimensional data is normalized and is pre-processed, wherein FkmWith Respectively indicate kth group high dimensional data in m dimension original property value and normalization after attribute value;max(Fm) and min (Fm) point It Biao Shi not high dimensional data F maximum attribute value and minimum attribute value in m dimension;K=1,2 ..., K, m=1,2 ..., M, K and M Respectively indicate the scale and dimension of high dimensional data F.
3. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to pass through multiple target Genetic algorithm carries out dimension extension to the high dimensional data after normalized, the high dimensional data after obtaining dimension extension, specific to wrap It includes:
Initialize the population of the multi-objective genetic algorithm;The population includes multiple individuals;The individual indicates the higher-dimension The extended mode of data;
Construct Multi-target evaluation index;Multi-target evaluation index include the high dimensional data extension dimension, topology keep index, Dunn index;
Go out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal extended mode;
Dimension extension is carried out to the high dimensional data after normalized according to the optimal extended mode, after obtaining dimension extension High dimensional data.
4. high dimensional data according to claim 3 visualizes clustering method, which is characterized in that the building multiple target Evaluation index specifically includes:
By in the individual binary coding each in statistics population 1 number, the extension dimension of the high dimensional data is determined;
According to formulaDetermine that the topology of each individual keeps index, wherein TP indicates that topology is protected Index is held, K indicates the scale of high dimensional data F, tkThe grade sequence for indicating kth group data, according to formulaIt determines, u and s indicate arest neighbors data point number, usually U=4, s=10, NNkyAnd nnkyY closest data points of luv space and mapping space kth group data point are respectively indicated, nnklAnd nnktRespectively indicate l and t closest data points of mapping space kth group data point;
According to formulaDetermine each individual Dunn index, DI Indicate Dunn index, d (x, y) indicates the Euclidean distance of mapping point x and y, Ci、CjAnd CkIndicate the cluster of mapping point i, j, k Cluster, nc indicate that mapping point clusters number of clusters,Indicate cluster CiWith cluster CjDistance;Indicate cluster Ck Diameter.
5. high dimensional data according to claim 3 visualizes clustering method, which is characterized in that it is described according to most Excellent extended mode carries out dimension extension to the high dimensional data after normalized, the high dimensional data after obtaining dimension extension, tool Body includes:
Each dimension of high dimensional data after counting normalized r etc. points of probability occurred in [0,1] value range, really The probability histogram of fixed each dimension;
Each probability histogram is divided using neighbour's propagation clustering algorithm, determines each dimension division result;
Dimension extension is carried out according to the division result and the optimal extended mode, the high dimension after obtaining dimension extension According to, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, the number after each dimension extension According to having and only one-dimensional data value is equal to the data value in corresponding original dimension.
6. high dimensional data according to claim 1 visualizes clustering method, which is characterized in that described to be reflected using class circle It penetrates method for visualizing and the high dimensional data after the extension of dimension described in each group is mapped into class space of circles, realize the visualization of high dimensional data Cluster, specifically includes:
Construct class space of circles CO, the class space of circles is two-dimensional Cartesian coordinate system using origin as the unit space of circles in the center of circle;
According toThe phase between high dimensional data dimension after determining the extension of each group dimension Guan Xing obtains similar matrix, wherein SijFor the element that the i-th row jth in the similar matrix arranges, K indicates the rule of high dimensional data F Mould, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is will be after dimension extension using 1 to M integer High dimensional data each group of data according to the attribute value size in each dimension carry out mark sequence numerical value;
By solving the corresponding feature vector of Laplace matrix maximum eigenvalue of the similar matrix, determine Fiedler to Amount;
It is ranked up, is obtained according to dimension of the element size in Fiedler vector to the high dimensional data after each group dimension extension High dimensional data after to sequence;
According to formulaEach dimension of high dimensional data after determining sequence is in COOn circular arc Coordinate points Vλ(i), whereinVector λ indicates the mark sequence vector of Fiedler vector element size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence;
In class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determines and arrive institute The distance for stating coordinate origin isPoint, be denoted as two-dimensional map point, whereinIt is kth group data in λ (i) dimension Attribute value, any individualCorresponding N number of two-dimensional map point;
One-to-one polygon is constituted by the corresponding two-dimensional space point set of each group of data, and determines the geometry of polygon Center;
It is distributed the same cluster spacing that neighborhood embedded mobile GIS reduces the polygon geometric center by t-, increases the polygon geometry The different cluster spacing at center determines map point location, realizes high dimensional data visualization cluster.
7. a kind of high dimensional data visualizes cluster analysis system, which is characterized in that the system comprises:
Preprocessing module, for pretreatment to be normalized to high dimensional data;
Dimension expansion module, for carrying out dimension extension to the high dimensional data after normalized by multi-objective genetic algorithm, High dimensional data after obtaining dimension extension;
Mapping block, for the high dimensional data after the extension of dimension described in each group to be mapped to class using class circle mapping method for visualizing Space of circles realizes the visualization cluster of high dimensional data.
8. high dimensional data according to claim 7 visualizes cluster analysis system, which is characterized in that the dimension expanded mode Block specifically includes:
Initialization unit, for initializing the population of the multi-objective genetic algorithm;The population includes multiple individuals;Described Body surface shows the extended mode of the high dimensional data;
Index construction unit, for constructing Multi-target evaluation index;Multi-target evaluation index includes the extension of the high dimensional data Dimension, topology keep index, Dunn index;
Screening unit, for going out optimal individual by Multi-target evaluation index screening, the optimal individual indicates optimal Extended mode;
Dimension expanding element, for carrying out dimension expansion to the high dimensional data after normalized according to the optimal extended mode Exhibition, the high dimensional data after obtaining dimension extension.
9. high dimensional data according to claim 8 visualizes cluster analysis system, which is characterized in that the dimension extension is single Member specifically includes:
Subelement is counted, for counting each dimension of the high dimensional data after normalized r etc. in [0,1] value range Existing probability is separated, determines the probability histogram of each dimension;
It divides subelement and determines each dimension for dividing using neighbour's propagation clustering algorithm to each probability histogram Division result;
Subelement is extended, for carrying out dimension extension according to the division result and the optimal extended mode, obtains dimension High dimensional data after extension, wherein the dimension after each dimension extension is equal to each dimension probability distribution histogram and clusters number of clusters, each Data after dimension extension have and only one-dimensional data value is equal to the data value in corresponding original dimension.
10. high dimensional data according to claim 7 visualizes cluster analysis system, which is characterized in that the mapping block, It specifically includes:
Class space of circles construction unit, for constructing class space of circles CO, it with origin is round that the class space of circles, which is two-dimensional Cartesian coordinate system, The unit space of circles of the heart;
Similar matrix determination unit is used for basisAfter determining the extension of each group dimension High dimensional data dimension between correlation, obtain similar matrix, wherein SijThe member arranged for the i-th row jth in the similar matrix Element, K indicate the scale of high dimensional data F, tkiIt is k-th group of data in the mark sequence value of i-th dimension, the mark sequence value is to utilize 1 to M The each group of data of high dimensional data after dimension extension is carried out mark sequence according to the attribute value size in each dimension by integer Numerical value;
Fiedler vector determination unit, it is corresponding for the Laplace matrix maximum eigenvalue by solving the similar matrix Feature vector determines Fiedler vector;
Sequencing unit, for the dimension according to element size in Fiedler vector to the high dimensional data after each group dimension extension Degree is ranked up, the high dimensional data after being sorted;
Coordinate points determination unit, for according to formulaHigh dimension after determining sequence According to each dimension in COCoordinate points V on circular arcλ(i), whereinVector λ indicates Fiedler vector element The mark sequence vector of size, λ (i) indicate that i-th of element value of vector λ, i=1,2 ..., N, N are the dimension of the high dimensional data after sequence Number;
Two-dimensional map point determination unit is used in class space of circles, to any high dimensional dataIn coordinate origin and coordinate points Vλ(i)On connected straight line, determine that the distance to the coordinate origin isPoint, be denoted as two-dimensional map point, wherein For attribute value of the kth group data in λ (i) dimension, any individualCorresponding N number of two-dimensional map point;
Geometric center determination unit, it is one-to-one polygon for being made up of the corresponding two-dimensional space point set of each group of data Shape, and determine the geometric center of polygon;
Visualization cluster realizes unit, for being distributed the same cluster that neighborhood embedded mobile GIS reduces the polygon geometric center by t- Spacing, the different cluster spacing for increasing the polygon geometric center determine map point location, realize high dimensional data visualization cluster.
CN201811517242.2A 2018-12-12 2018-12-12 High-dimensional data visual clustering analysis method and system Active CN109271441B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811517242.2A CN109271441B (en) 2018-12-12 2018-12-12 High-dimensional data visual clustering analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811517242.2A CN109271441B (en) 2018-12-12 2018-12-12 High-dimensional data visual clustering analysis method and system

Publications (2)

Publication Number Publication Date
CN109271441A true CN109271441A (en) 2019-01-25
CN109271441B CN109271441B (en) 2020-09-01

Family

ID=65187645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811517242.2A Active CN109271441B (en) 2018-12-12 2018-12-12 High-dimensional data visual clustering analysis method and system

Country Status (1)

Country Link
CN (1) CN109271441B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162568A (en) * 2019-05-24 2019-08-23 东北大学 A kind of three-dimensional data method for visualizing based on PCA-Radviz
CN110308873A (en) * 2019-06-24 2019-10-08 浙江大华技术股份有限公司 A kind of date storage method, device, equipment and medium
CN110458187A (en) * 2019-06-27 2019-11-15 广州大学 A kind of malicious code family clustering method and system
CN110781569A (en) * 2019-11-08 2020-02-11 桂林电子科技大学 Multi-resolution grid division based anomaly detection method and system
CN113095427A (en) * 2021-04-23 2021-07-09 中南大学 High-dimensional data analysis method and face data analysis method based on user guidance

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764676A (en) * 2018-05-17 2018-11-06 南昌航空大学 A kind of higher-dimension multi-objective assessment method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764676A (en) * 2018-05-17 2018-11-06 南昌航空大学 A kind of higher-dimension multi-objective assessment method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANDREAS KÖNIG: "Interactive Visualization and Analysis of Hierarchical Neural Projections for Data Mining", 《IEEE TRANSACTIONS ON NEURAL NETWORKS》 *
JOHN F. MCCARTHY ET AL.: "Applications of Machine Learning and High‐Dimensional Visualization in Cancer Detection, Diagnosis, and Management", 《ANNALS OF THE NEW YORK ACADEMY OF SCIENCES》 *
周芳芳 等: "基于维度扩展的Radviz可视化聚类分析方法", 《软件学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162568A (en) * 2019-05-24 2019-08-23 东北大学 A kind of three-dimensional data method for visualizing based on PCA-Radviz
CN110308873A (en) * 2019-06-24 2019-10-08 浙江大华技术股份有限公司 A kind of date storage method, device, equipment and medium
CN110458187A (en) * 2019-06-27 2019-11-15 广州大学 A kind of malicious code family clustering method and system
CN110781569A (en) * 2019-11-08 2020-02-11 桂林电子科技大学 Multi-resolution grid division based anomaly detection method and system
CN110781569B (en) * 2019-11-08 2023-12-19 桂林电子科技大学 Abnormality detection method and system based on multi-resolution grid division
CN113095427A (en) * 2021-04-23 2021-07-09 中南大学 High-dimensional data analysis method and face data analysis method based on user guidance

Also Published As

Publication number Publication date
CN109271441B (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN109271441A (en) A kind of visualization clustering method of high dimensional data and system
Wang et al. CE3: A three-way clustering method based on mathematical morphology
CN110096500B (en) Visual analysis method and system for ocean multidimensional data
Zhang et al. Local density adaptive similarity measurement for spectral clustering
Li et al. Comparative density peaks clustering
Li et al. A clustering method based on K-means algorithm
CN111199214B (en) Residual network multispectral image ground object classification method
CN109993748A (en) A kind of three-dimensional grid method for segmenting objects based on points cloud processing network
Wang et al. Learning context-sensitive similarity by shortest path propagation
Rieck et al. Multivariate data analysis using persistence-based filtering and topological signatures
CN105354593B (en) A kind of threedimensional model sorting technique based on NMF
CN106055580B (en) A kind of fuzzy clustering result visualization method based on Radviz
CN112990010B (en) Point cloud data processing method and device, computer equipment and storage medium
CN110210431A (en) A kind of point cloud classifications method based on cloud semantic tagger and optimization
CN106650744A (en) Image object co-segmentation method guided by local shape migration
CN102890703A (en) Network heterogeneous multidimensional scaling (HMDS) method
CN108764676A (en) A kind of higher-dimension multi-objective assessment method and system
Xu et al. Enhancing 2D representation via adjacent views for 3D shape retrieval
CN114120067A (en) Object identification method, device, equipment and medium
Poojitha et al. A collocation of IRIS flower using neural network clustering tool in MATLAB
CN114332172A (en) Improved laser point cloud registration method based on covariance matrix
CN106872972B (en) Near space Electromagnetic Scattering of Target data capture method based on sextuple interpolation
Speer et al. Clustering gene expression data with memetic algorithms based on minimum spanning trees
CN113673619B (en) Geographic big data space latent pattern analysis method based on topology analysis
CN115690439A (en) Feature point aggregation method and device based on image plane and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant