CN109271441B

CN109271441B - High-dimensional data visual clustering analysis method and system

Info

Publication number: CN109271441B
Application number: CN201811517242.2A
Authority: CN
Inventors: 黎明; 黄珊; 陈昊; 陈震; 李军华; 张聪炫
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-09-01
Anticipated expiration: 2038-12-12
Also published as: CN109271441A

Abstract

The invention discloses a high-dimensional data visual clustering analysis method and system. The method comprises the following steps: carrying out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data. The method or the system can effectively realize the visual clustering of high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.

Description

High-dimensional data visual clustering analysis method and system

Technical Field

The invention relates to the field of high-dimensional data visual clustering, in particular to a high-dimensional data visual clustering analysis method and system.

Background

The visualization technology is an important data analysis tool, and the internal structure, information and knowledge of data are expressed mainly by computer graphics, image processing, signal processing and other methods, so that the method is beneficial to researches such as pattern recognition, outlier detection and the like. With the rapid development of computers and sensing equipment, multi-dimensional and even high-dimensional data widely exist in the fields of economy, medicine, military, industry and the like, such as high-dimensional functional magnetic resonance imaging data, three-layer defense systems of multi-dimensional structures and the like. The increase of data dimension and scale brings new opportunities for data visualization. However, the traditional rectangular coordinates can express three-dimensional data at most and are not suitable for visualization research of high-dimensional data.

At present, the high-dimensional visualization technology mainly has two types. One of them is a dimension reduction method, which maps high-dimensional data to a low-dimensional space and represents the reduced data by scatter or other symbols. Mainly comprises principal component analysis, self-organizing mapping, neuron measurement method and the like. Although the dimension reduction visualization method can overcome the dimension disaster of the visualization technology in a certain sense, the dimension reduction visualization method can cause the loss of potentially important information, and the accuracy of high-dimensional data analysis is restricted. Another type of method obtains visualization results without using dimension reduction techniques, such as scatter plot matrices, parallel coordinate systems, and heat plot, which can represent high-dimensional data information intact. However, as the dimension and scale of data increase, a large number of curves or color blocks are complicatedly interlaced together due to the limitation of a screen, and the effectiveness of visualization is greatly restricted.

Compared with the above methods, the Radial layout Visualization method represented by Radial Visualization (RadViz) and Star Coordinates (SC) has a significant advantage in expressing high-dimensional data. The radial layout visualization method characterizes the data dimensions by circular radii and maps each individual to a point in a low dimensional space. The method can not only efficiently express any dimension data in a low-dimensional space, but also project the data with similar characteristics to similar positions, thereby forming a better visual clustering effect. However, RadViz is defined as a general non-linear mapping that does not take into account the shape and distribution of the data; and SC itself is a linear visualization method. Therefore, when the data is a nonlinear manifold structure, the traditional radial layout visualization method has a limitation in capturing the nonlinear data structure.

Therefore, how to efficiently realize the visualized clustering of the high-dimensional data, especially the high-dimensional data containing the nonlinear structure, is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a high-dimensional data visual clustering analysis method and system, which are used for efficiently realizing visual clustering of high-dimensional data, particularly high-dimensional data containing nonlinear structures.

In order to achieve the purpose, the invention provides the following scheme:

a method of high dimensional data visualization cluster analysis, the method comprising:

carrying out normalization preprocessing on the high-dimensional data;

performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion;

and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data.

Optionally, the performing normalization preprocessing on the high-dimensional data specifically includes:

according to the formula

Normalizing pre-processing the high dimensional data, wherein F_kmAnd

respectively representing the original attribute value and the normalized attribute value of the kth group of high-dimensional data on the mth dimension; max (F)_m) And min (F)_m) Respectively representing the maximum attribute value and the minimum attribute value of the high-dimensional data F on the mth dimension; k1, 2,., K, M1, 2,., M, K, and M represent the scale and the dimension of the high-dimensional data F, respectively.

Optionally, the performing, by the multi-target genetic algorithm, dimension expansion on the high-dimensional data after the normalization processing to obtain the high-dimensional data after the dimension expansion specifically includes:

initializing a population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;

constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data;

screening out an optimal individual through the multi-target evaluation index, wherein the optimal individual represents an optimal expansion state;

and performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.

Optionally, the constructing a multi-target evaluation index specifically includes:

determining the expansion dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population;

according to the formula

Determining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index and K represents a gauge of the high-dimensional data FModule, t_kRepresenting rank ordering of the kth group of data, according to formula

It is determined that u and s both represent the number of nearest neighbor data points, typically u is 4, s is 10, NN_kyAnd nn_kyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectively_klAnd nn_ktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space;

according to the formula

Determining Dunn index for each of said individuals, DI representing Dunn index, d (x, y) representing Euclidean distance between mapping points x and y, C_i、C_jAnd C_kAll represent the cluster of mapping points i, j, k, nc represents the number of the cluster of mapping points,

represents a cluster C_iAnd cluster C_jThe distance of (d);

represents a cluster C_kOf (c) is measured.

Optionally, performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion, specifically including:

counting the probability of r equal division of each dimension of the high-dimensional data after normalization processing in a value range of [0, 1], and determining a probability histogram of each dimension;

dividing each probability histogram by utilizing a neighbor propagation clustering algorithm, and determining each dimension division result;

and performing dimension expansion according to the division result and the optimal expansion state to obtain dimension-expanded high-dimensional data, wherein the dimension after each dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of each dimension-expanded data is equal to the data value of the corresponding original dimension.

Optionally, the mapping each group of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method to implement the visual clustering of the high-dimensional data specifically includes:

constructing a circle-like space C_OThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

according to

Determining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein S_ijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, t_kiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;

determining a Fiedler vector by solving a eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similarity matrix;

sorting the dimensionalities of the high-dimensional data after each group of dimensionality expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;

according to the formula

Determining the dimensions of the sorted high-dimensional data to be C_OCoordinate point V on arc_λ(i)Wherein, in the step (A),

the vector lambda represents a standard sequence vector of the sizes of elements of the Fiedler vector, lambda (i) represents the ith element value of the vector lambda, and i is 1, 2.

In the quasi-circular space, for any high-dimensional data

At the origin of coordinates and coordinate point V_λ(i)On a straight line connecting, determining the distance to the origin of coordinates as

Is recorded as a two-dimensional mapping point, wherein,

for the property value of the kth group of data in the lambda (i) dimension, any one of the individuals

Corresponding to the N two-dimensional mapping points;

forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data, and determining the geometric center of the polygons;

and reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visual clustering of high-dimensional data.

A high dimensional data visualization cluster analysis system, the system comprising:

the preprocessing module is used for carrying out normalization preprocessing on the high-dimensional data;

the dimensionality extension module is used for carrying out dimensionality extension on the high-dimensional data after the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data after the dimensionality extension;

and the mapping module is used for mapping each group of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method so as to realize the visualization clustering of the high-dimensional data.

Optionally, the dimension extension module specifically includes:

the initialization unit is used for initializing the population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data;

the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data;

the screening unit is used for screening out the optimal individual through the multi-target evaluation index, and the optimal individual represents the optimal expansion state;

and the dimension expansion unit is used for performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion.

Optionally, the dimension extension unit specifically includes:

the statistical subunit is used for counting the probability that r is equally divided in the value range of [0, 1] of each dimension of the high-dimensional data after the normalization processing, and determining a probability histogram of each dimension;

the dividing unit is used for dividing each probability histogram by utilizing a neighbor propagation clustering algorithm and determining each dimension dividing result;

and the expansion subunit is used for performing dimension expansion according to the division result and the optimal expansion state to obtain high-dimensional data after the dimension expansion, wherein the dimension after the dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, and only one-dimensional data value of the data after the dimension expansion is equal to the data value on the corresponding original dimension.

Optionally, the mapping module specifically includes:

a circle-like space construction unit for constructing a circle-like space C_OThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

a similarity matrix determination unit for determining a similarity matrix based on

Determining the correlation among the high-dimensional data dimensions after each group of dimension expansion to obtain a similarity matrix, wherein S_ijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, t_kiThe sorting value of the ith dimension for the kth group of data is the number of each group of high-dimensional data with the dimension expanded by 1 to M integersAccording to the value of the attribute value in each dimension, carrying out the order marking;

the Fiedler vector determining unit is used for determining a Fiedler vector by solving the eigenvector corresponding to the maximum eigenvalue of the Laplace matrix of the similar matrix;

the sorting unit is used for sorting the dimensionality of the high-dimensional data after each group of dimensionality expansion according to the size of the elements in the Fiedler vector to obtain the sorted high-dimensional data;

a coordinate point determination unit for determining a coordinate point according to a formula

A two-dimensional mapping point determining unit for determining any high-dimensional data in a circle-like space

Is recorded as a two-dimensional mapping point, wherein,

Corresponding to the N two-dimensional mapping points;

the geometric center determining unit is used for forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the groups of data and determining the geometric centers of the polygons;

and the visual clustering realization unit is used for reducing the same cluster spacing of the geometric center of the polygon through a t-distribution neighborhood embedding algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point and realizing the visual clustering of high-dimensional data.

Compared with the prior art, the invention has the following technical effects: the method carries out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the high-dimensional data visualized clustering analysis method and the system provided by the invention can ensure the scientificity and effectiveness of visualized clustering analysis, thereby more efficiently realizing the visualized clustering of the high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a high dimensional data visualization clustering analysis method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a high dimensional data visualization cluster analysis system according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a probability histogram and a partitioning result of each dimension of the iris dataset when r is 20 according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the high-dimensional data visualization cluster analysis method includes the following steps:

step 101: and carrying out normalization preprocessing on the high-dimensional data.

According to the formula

Normalizing pre-processing the high dimensional data, wherein F_kmAnd

Step 102: and performing dimension expansion on the high-dimensional data after the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data after the dimension expansion. The method specifically comprises the following steps:

1) initializing a population of the multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents an expanded state of the high-dimensional data.

2) Constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data. Specifically, the method comprises the following steps:

and determining the extension dimension of the high-dimensional data by counting the number of 1 in each individual binary code in the population.

According to the formula

Determining a topology maintenance index for each of the individuals, wherein TP represents the topology maintenance index, K represents the scale of the high-dimensional data F, and t_kRepresenting rank ordering of the kth group of data, according to formula

according to the formula

represents a cluster C_iAnd cluster C_jThe distance of (d);

represents a cluster C_kOf (c) is measured.

3) And screening out the optimal individual through the multi-target evaluation index, wherein the optimal individual represents the optimal expansion state.

4) And performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion. Counting the probability of r equal division of each dimension of the high-dimensional data after normalization processing in a value range of [0, 1], and determining a probability histogram of each dimension; dividing each probability histogram by utilizing a neighbor propagation clustering algorithm, and determining each dimension division result; and performing dimension expansion according to the division result and the optimal expansion state to obtain dimension-expanded high-dimensional data, wherein the dimension after each dimension expansion is equal to the number of clustering clusters of the probability distribution histogram of each dimension, only one-dimensional data value of each dimension-expanded data is equal to the data value of the corresponding original dimension, the dimension is equal to the equal division of the data value of the original dimension, and the data values of the rest dimensions are 0.

Step 103: and mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data. The method specifically comprises the following steps:

construct a structure quasi-circular space C_OThe circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

according to

according to the formula

Determining the dimensions of the sorted high-dimensional data to be C_OOn a circular arcCoordinate point V_λ(i)Wherein, in the step (A),

In the quasi-circular space, for any high-dimensional data

Is recorded as a two-dimensional mapping point, wherein,

Corresponding to the N two-dimensional mapping points;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the method carries out normalization preprocessing on the high-dimensional data; performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the high-dimensional data visualized clustering analysis method and the system provided by the invention can ensure the scientificity and effectiveness of visualized clustering analysis, thereby more efficiently realizing the visualized clustering of the high-dimensional data, particularly the high-dimensional data containing the nonlinear structure.

The visualized cluster analysis method proposed in this patent is described below by taking a 150-dimensional iris data set as an example.

Step A: the method comprises the following steps of carrying out normalization pretreatment on the iris data set, wherein the normalization pretreatment specifically comprises the following steps:

according to the formula

Normalization pre-processing of Iris data set F, wherein F_kmAnd

respectively representing the original attribute value and the normalized attribute value of the kth group of iris data sets in the mth dimension; max (F)_m) And min (F)_m) Respectively representing the maximum and minimum attribute values of the iris dataset in the mth dimension;

k

1,2, 150,

m

1,2,3, 4;

and B, performing dimension expansion on the iris data set subjected to the normalization treatment through an NSGAII multi-target genetic algorithm to obtain an iris data set subjected to the dimension expansion, wherein the method specifically comprises the following steps:

initializing a population of the NSGAII multi-target genetic algorithm; the population comprises a plurality of individuals; the individual represents the expansion state of the high-dimensional data binary code, the length of the high-dimensional data binary code is Iris florida dataset dimension 4, wherein 1 and 0 in the binary code respectively represent the corresponding Iris florida dataset dimension and do not carry out dimension expansion;

constructing a multi-target evaluation index, wherein the multi-target evaluation index comprises the expansion dimension, the topology maintenance index and the Dunn index of the iris data set;

screening out an optimal individual through the multi-target evaluation index, wherein the optimal individual represents an optimal expansion state of the iris data set;

performing dimension expansion on the iris data set subjected to the normalization processing according to the optimal expansion state to obtain the iris data set subjected to the dimension expansion, and specifically comprising the following steps:

counting the probability that each dimension of the iris data set after normalization processing appears in 20 equal divisions on the value range of [0, 1], and determining 4 dimension probability histograms;

dividing each of the 4 probability histograms by using a neighbor propagation clustering algorithm to determine 4 dimension division results, wherein the division of the probability distribution can be regarded as clustering two-dimensional data, and the two-dimensional data are x-axis (namely value) and y-axis (namely probability value) of each dimension probability distribution histogram respectively. Fig. 3 shows probability histograms of 4 dimensions of the iris data set and the division results, in which two-dimensional data coordinates are represented by scatter points, and scatter points of the same division type are connected by a broken line of the same type.

And performing dimension expansion according to the division result and the optimal expansion state to obtain a dimension-expanded high-dimensional iris data set, wherein the dimension after 4 dimension expansion is equal to the number of corresponding probability distribution histogram cluster, each dimension-expanded data has and only has one-dimensional data value equal to the data value on the corresponding original dimension, the dimension is equal to the equal division of the original dimension data value, and the data values on the other dimensions are 0. For example, fig. 3 illustrates that the first dimension of the iris dataset is divided into 3 parts, including 6, 7, and 7 data points, respectively. I.e. the first dimension of the Iris dataset is extended to three new dimensions and divided where the probabilities are 0.3 and 0.65. From this, it can be seen that if the data values of the 3 sets of data in the first dimension of the iris data set are 0.2, 0.5, and 0.8, the values in the new dimension are [0.200], [00.50], [000.8], respectively.

And C: respectively mapping the high-dimensional iris data sets with the extended dimensions to a circle-like space by using a circle-like mapping visualization method, specifically comprising:

constructing a circle-like space, wherein the circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

according to

Determining each groupObtaining a similarity matrix by the correlation between dimensionalities of the data sets of the iris after dimensionality extension, wherein S_ijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, t_kiThe ordering value of the kth group of data in the ith dimension is a numerical value obtained by ordering each group of data of the iris data subjected to dimension expansion according to the attribute value of each dimension by using 1 to N integers, wherein N is the dimension of the iris data set subjected to dimension expansion;

sorting the dimensions of the high-dimensional iris data sets after the dimension expansion according to the sizes of elements in the Fiedler vector to obtain sorted high-dimensional data;

according to the formula

Determining the dimension of the sorted high-dimensional iris data set at C_OCoordinate point V on arc_λ(i)Wherein, in the step (A),

Iris data set expanded for any dimension in circle-like space

Is recorded as a two-dimensional mapping point, wherein,

properties in the λ (i) th dimension for the kth group of dataValue, any individual

Corresponding to the N two-dimensional mapping points;

forming one-to-one corresponding polygons through the two-dimensional space point sets corresponding to the iris data sets, and determining the geometric center of the polygons;

and reducing the same cluster spacing of the geometric center of the polygon through a t-SNE algorithm, increasing the different cluster spacing of the geometric center of the polygon to determine the position of a mapping point, and realizing the visualized clustering of the iris data set. .

As shown in fig. 2, the present invention further provides a high dimensional data visualization cluster analysis system, which includes:

and the preprocessing module 201 is configured to perform normalization preprocessing on the high-dimensional data. According to the formula

Normalizing pre-processing the high dimensional data, wherein F_kmAnd

And the dimension expansion module 202 is configured to perform dimension expansion on the normalized high-dimensional data through a multi-target genetic algorithm to obtain the high-dimensional data after the dimension expansion.

The dimension extension module 202 specifically includes:

the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises the extended dimension and topology of the high-dimensional data

Maintenance index, Dunn index; specifically, the number of 1 in each individual binary code in the population is counted to determine the extension dimension of the high-dimensional data; according to the formula

according to the formula

represents a cluster C_iAnd cluster C_jThe distance of (d);

represents a cluster C_kOf (c) is measured.

The dimension extension unit specifically includes:

And the mapping module 203 is configured to map each set of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method, so as to implement visual clustering of the high-dimensional data.

The mapping module 203 specifically includes:

Is recorded as a two-dimensional mapping point, wherein,

Corresponding to the N two-dimensional mapping points;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A high-dimensional data visualization cluster analysis method is characterized by comprising the following steps:

carrying out normalization preprocessing on the high-dimensional data;

performing dimension expansion on the high-dimensional data subjected to the normalization processing through a multi-target genetic algorithm to obtain the high-dimensional data subjected to the dimension expansion; the method specifically comprises the following steps:

constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data; the method specifically comprises the following steps:

according to the formula

Determining that u and s both represent the number of nearest neighbor data points, NN_kyAnd nn_kyY nearest data points, nn, representing the set of data points k in original space and in mapped space, respectively_klAnd nn_ktRespectively representing the i and t nearest data points of the kth group of data points in the mapping space; according to the formula

represents cluster C and the distance of cluster C;

represents the diameter of cluster C;

performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion;

2. The high-dimensional data visualization cluster analysis method according to claim 1, wherein the normalization preprocessing is performed on the high-dimensional data, and specifically comprises: according to the formula

Normalizing pre-processing the high dimensional data, wherein F_kmAnd

3. The high-dimensional data visualization cluster analysis method according to claim 1, wherein performing dimension expansion on the high-dimensional data after the normalization processing according to the optimal expansion state to obtain the high-dimensional data after the dimension expansion specifically comprises:

4. The method for high-dimensional data visual cluster analysis according to claim 1, wherein the step of mapping each set of the dimensionality-expanded high-dimensional data to a circle-like space by using a circle-like mapping visualization method to realize visual clustering of the high-dimensional data specifically comprises the steps of:

constructing a circle-like space C₀The circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

according to

Determining each groupObtaining a similarity matrix by the correlation between the dimensionalities of the high-dimensional data after the dimensionality expansion, wherein S_ijFor the element in the ith row and the jth column in the similarity matrix, K represents the scale of the high-dimensional data F, t_kiOrdering values of the kth group of data in the ith dimension, wherein the ordering values are numerical values obtained by ordering each group of data of the high-dimensional data after dimension expansion according to the size of the attribute value in each dimension by using 1-M integers;

according to the formula

Determining the dimensions of the sorted high-dimensional data to be C₀Coordinate point V on arc_λ(i)Wherein, in the step (A),

In the quasi-circular space, for any high-dimensional data

Is recorded as a two-dimensional mapping point, wherein,

Corresponding to the N two-dimensional mapping points;

5. A high dimensional data visualization cluster analysis system, the system comprising:

the mapping module is used for mapping each group of the high-dimensional data after the dimensionality expansion to a circle-like space by using a circle-like mapping visualization method to realize the visualization clustering of the high-dimensional data;

the dimension extension module specifically includes:

the index construction unit is used for constructing a multi-target evaluation index; the multi-target evaluation index comprises an expansion dimension, a topology maintenance index and a Dunn index of the high-dimensional data; the method specifically comprises the following steps:

according to the formula

Determining a topology maintenance indicator for each of said individuals, wherein TP represents topology maintenanceIndex, K denotes the scale of the high dimensional data F, t_kRepresenting rank ordering of the kth group of data, according to formula

represents cluster C and the distance of cluster C;

represents the diameter of cluster C;

6. The high-dimensional data visualization cluster analysis system according to claim 5, wherein the dimension extension unit specifically comprises:

7. The high-dimensional data visualization cluster analysis system according to claim 5, wherein the mapping module specifically comprises:

a circle-like space construction unit for constructing a circle-like space C₀The circle-like space is a unit circle space of a two-dimensional rectangular coordinate system with an original point as a circle center;

Is recorded as a two-dimensional mapping point, wherein,

Corresponding to the N two-dimensional mapping points;