CN107368599B

CN107368599B - Visual analysis method and system for high-dimensional data

Info

Publication number: CN107368599B
Application number: CN201710620143.6A
Authority: CN
Inventors: 夏佳志; 李强; 廖胜辉; 奎晓燕; 王建新
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2020-06-23
Anticipated expiration: 2037-07-26
Also published as: CN107368599A

Abstract

The invention relates to a visual analysis method of high-dimensional data, which comprises the steps of establishing local subspace difference-geodesic distance projection on original high-dimensional data; establishing mapping of clustering point clusters; visual analysis views of the series of subspaces are established. The invention also discloses an analysis system for realizing the visual analysis method of the high-dimensional data. According to the invention, a series of interactive visual analysis operations are provided by establishing local subspace difference-geodesic distance projection, clustering point cluster mapping and a series of subspace visual analysis views, so that a reliable technical basis is provided for visual subspace clustering and analysis, a user can be effectively guided and helped to effectively analyze and explore high-dimensional data, the times of user tests and errors are obviously reduced in high-dimensional data processing, the redundancy of data is reduced, the interchangeability of a data analysis process is enhanced, and the reliability of results is improved.

Description

Visual analysis method and system for high-dimensional data

Technical Field

The invention particularly relates to a visual analysis method and a visual analysis system for high-dimensional data.

Background

With the development of national economic technology and the arrival of digital society, data has become an indispensable part in production and life of people. People are trading everyday with endless data, such as financial data, scientific computing data, biomedical data, and the like. Therefore, data analysis is one of the most popular areas of development today. Data mining and visual analysis technology is an important part in the field of information technology; in the data mining and analyzing process, a powerful visualization analysis method can double the effect with half the effort.

High dimensionality is one of the important features of large data. High-dimensional data often contains subspace cluster structures. Automated subspace clustering methods often generate highly redundant and difficult to understand subspace clustering results. Visual analysis is an effective method for enhancing user cognition and helping user understanding. However, most of current visual analysis methods for subspace clustering are automated method result-oriented visualization, rather than the subspace clustering task itself, and it is difficult to effectively improve the result of the subspace clustering method.

Disclosure of Invention

One of the objectives of the present invention is to provide a method for visually analyzing high-dimensional data, which can effectively guide and help users to effectively analyze and explore high-dimensional data.

It is another object of the present invention to provide an analysis system for implementing the method for visually analyzing high-dimensional data.

The invention provides a visual analysis method of high-dimensional data, which comprises the following steps:

s1, establishing local subspace difference-geodesic distance projection on original high-dimensional data;

s2, establishing mapping of a clustering point cluster in the projection of the local subspace difference-geodesic distance;

and S3, establishing a visual analysis view of the series subspace according to the local subspace difference-geodesic distance projection obtained in the step S1 and the mapping of the clustering point cluster obtained in the step S2.

Step S1, which is to establish the local subspace difference-geodesic distance projection on the original high-dimensional data, specifically, the following steps are adopted to establish the projection:

A. establishing a data point correlation measurement based on geodesic distance aiming at high-dimensional data needing projection;

B. b, establishing a local subspace difference metric according to the metric established in the step A;

C. and B, establishing local subspace difference-geodesic distance projection according to the metrics established in the steps A and B.

Step A, establishing a data point correlation measurement based on geodesic distance, specifically adopting the following steps to establish the measurement:

constructing an S-NN graph with a plurality of connected components on the basis of a high-order data set needing projection;

II, aiming at each connected component in the step I, connecting any two independent connected components;

and III, calculating the shortest distance between any two points so as to obtain the geodesic distance.

And step II, connecting any two independent connected components, specifically connecting two data points closest to each other in the two connected components.

And step III, calculating the shortest distance, specifically adopting a shortest path algorithm to calculate.

Step B, establishing the local subspace difference metric, specifically, establishing the metric by using the following steps:

1) the weights for each dimension are calculated using the following formula:

ω＝[ω₁,ω₂,...,ω_i,...,ω_d],

where ω is a dimensional weight matrix, ω_iWeight, σ, representing the ith dimension_iRepresenting the variance of the data points in the ith dimension, wherein d is the number of the dimensions;

2) calculating the weighted distance between any two points in the SNN graph by adopting the following formula:

d_pq[W]＝max{d_pq[ω_p],d_pq[ω_q]}

wherein

d_pq[W]Is a weighted distance matrix, ω, of points p and q_p＝[ω_p1,ω_p2,...,ω_pi,...,ω_pd]The feature vector, ω, representing the local subspace of the point p_q＝[ω_q1,ω_q2,...,ω_qi,...,ω_qd]Local subspace representing qCharacteristic vector of d_iIs the Euclidean distance between a local subspace p and a point q in the ith dimension, d_pq[ω_p]Weighted distance of point q with respect to point p, d_pq[ω_q]Is the weighted distance of point p relative to point q;

3) based on cosine similarity, a delta matrix is established according to the following formula:

in the formula

Is a point p_iAnd p_jDifference values based on cosine similarity; i and j are numbers of data points, the numeric range is [0, n), and n is the size of the data set.

And C, establishing a local subspace difference-geodesic distance projection, namely mapping the data points in the space to an x axis through a local subspace difference metric by using an MDS algorithm, and mapping to a y axis through a data point correlation metric of the geodesic distance.

The step S2 of establishing the mapping of the cluster of clustering points specifically includes the following steps:

extracting the data point information which is selected in the local subspace difference-geodesic distance projection established in the step S1, wherein the selection operation is completed by user interaction operation to obtain a data point set to be processed;

(II) calculating a feature vector of the data point set obtained in the step (I) by using a PCA (principal Component analysis) method;

and (III) selecting the two eigenvectors with the minimum eigenvalue in the step (II) to generate mapping transformation on a two-dimensional plane, thereby completing the mapping of the cluster point clusters.

The step S3 of establishing a visual analysis view of the series of subspaces specifically includes the following steps:

(a) performing non-parametric representation on the subspace;

(b) performing similarity measurement on the subspaces;

(c) and establishing a contrast interactive operation interface of the subspace according to the non-parametric representation and the similarity measurement of the subspace, thereby completing the visual analysis of the high-dimensional data.

Performing non-parametric representation on the subspace in the step (a), specifically performing non-parametric representation by adopting the following rules:

defining a subspace S accommodating the clusters C_cIs { omega [ p ]]P ∈ C, where ω p]Is the local subspace feature vector for point p.

Performing similarity measurement on the subspace in the step (b), specifically, performing similarity measurement by adopting the following steps:

for a d-dimensional subspace housing a cluster C having n points, the eigenvector is defined by the following equation

Wherein

In the formula [ omega ]_i1,ω_i2,...,ω_ij,...ω_id]For clustering point p in C_iThe subspace feature vector of (a);

(ii) using the eigenvectors obtained in step (i)

The cosine similarity of the two subspaces is measured.

Establishing a subspace comparison interactive operation interface in the step (c), specifically adopting the following rules to establish an operation interface:

hypothetical clustering of C₁Is contained in subspace V₁In, cluster C₂Is contained in subspace V₂And (3) establishing an operation interface by adopting the following rules:

subspace summation interface: subspace V₁+V₂Result of (A) is C₁∩C₂Subspace V₁+V₂Is V₁And V₂A union of active dimensions;

subspace interface finding: subspace V₁∩V₂Result of (A) is C₁∪C₂Subspace V₁∩V₂Is V₁And V₂The intersection of active dimensions of;

subspace list ordering interface: given a set of selected dimensions, the feature vectors in the selected dimensions are used to judge the similarity of one-degree subspaces, and a set of subspaces is mapped onto a 1-dimensional coordinate axis through a 1-dimensional MDS (multidimensional scaling) algorithm to form a sorted list of subspaces.

The invention also discloses an analysis system for realizing the visual analysis method of the high-dimensional data, which comprises a local subspace difference-geodesic distance projection establishing module, a mapping establishing module of a cluster point cluster and a visual analysis view establishing module of a series of subspaces which are sequentially connected in series; the local subspace difference-geodesic distance projection establishing module is used for establishing local subspace difference-geodesic distance projection on the original high-dimensional data and uploading a mapping establishing module of the clustering point cluster; the mapping establishing module of the clustering point cluster is used for establishing the mapping of the clustering point cluster according to the local subspace difference-geodesic distance projection and uploading a visual analysis view establishing module of a series of subspaces; and the visual analysis view establishing module of the series subspace is used for establishing the visual analysis view of the series subspace according to the established mapping of the local subspace difference-geodesic distance projection and the clustering point cluster.

According to the visual analysis method and the visual analysis system for the high-dimensional data, provided by the invention, a series of interactive visual analysis operations are provided by establishing local subspace difference-geodesic distance projection, clustering point cluster mapping and a series of subspace visual analysis views, so that a reliable technical basis is provided for visual subspace clustering and analysis, a user can be effectively guided and helped to effectively analyze and explore the high-dimensional data, the times of user tests and errors are obviously reduced in the high-dimensional data processing, the redundancy of the data is reduced, the interchangeability of the data analysis process is enhanced, and the reliability of results is improved.

Drawings

FIG. 1 is a process flow diagram of the process of the present invention.

FIG. 2 is a functional block diagram of the system of the present invention.

Detailed Description

FIG. 1 shows a flow chart of the method of the present invention: the invention provides a visual analysis method of high-dimensional data, which comprises the following steps:

s1, establishing a local subspace difference-geodesic distance projection on original high-dimensional data, specifically adopting the following steps to establish the projection:

A. aiming at high-dimensional data needing projection, establishing a data point correlation measurement based on geodesic distance, specifically adopting the following steps to establish the measurement:

constructing an S-NN graph with a plurality of connected components on the basis of a high-order data set needing projection; the SNN map refers to a sub-map of the K-NN map. Specifically, in the SNN graph, there is an edge between points p, q if and only if they are k neighbors;

II, aiming at each connected component in the step I, connecting any two independent connected components, specifically connecting two data points which are closest to each other in the two connected components;

calculating the shortest distance between any two points by adopting a shortest path algorithm so as to obtain the geodesic distance;

B. and B, establishing a local subspace difference metric according to the metric established in the step A, wherein the metric is established by adopting the following steps:

1) the weights for each dimension are calculated using the following formula:

ω＝[ω₁,ω₂,...,ω_i,...,ω_d],

d_pq[W]＝max{d_pq[ω_p],d_pq[ω_q]}

wherein

d_pq[W]Is a weighted distance matrix, ω, of points p and q_p＝[ω_p1,ω_p2,...,ω_pi,...,ω_pd]The feature vector, ω, representing the local subspace of the point p_q＝[ω_q1,ω_q2,...,ω_qi,...,ω_qd]Feature vector representing local subspace of q, d_iIs the Euclidean distance between a local subspace p and a point q in the ith dimension, d_pq[ω_p]Weighted distance of point q with respect to point p, d_pq[ω_q]Is the weighted distance of point p relative to point q;

in the formula

Is a point p_iAnd p_jDifference values based on cosine similarity; i and j are numbers of data points, the numeric range is [0, n), and n is the size of the data set;

C. b, according to the measurement established in the steps A and B, establishing a local subspace difference-geodesic distance projection, specifically, mapping data points in the space to an x axis through a local subspace difference measurement by using an MDS algorithm, and mapping to a y axis through a data point correlation measurement of the geodesic distance; the x-axis represents the local subspace difference and the y-axis represents the geodesic distance measure. In the local subspace difference-geodesic distance mapping, the clusters are distinguished;

s2, establishing mapping of the clustering point clusters in the projection of the local subspace difference-geodesic distance, specifically adopting the following steps to establish mapping:

extracting the data point information which is selected in the local subspace difference-geodesic distance projection established in the step S1, wherein the selected data points are obtained by the user through self selection, and a data point set to be processed is obtained;

(III) selecting two eigenvectors with the minimum eigenvalues in the step (II) to generate mapping transformation on a two-dimensional plane, thereby completing the mapping of the cluster point clusters;

s3, establishing a visual analysis view of a series of subspaces according to the local subspace difference-geodesic distance projection obtained in the step S1 and the mapping of the clustering point cluster obtained in the step S2, specifically adopting the following steps to establish the visual analysis view:

(a) and performing non-parametric representation on the subspace, specifically performing non-parametric representation by adopting the following rules:

defining a subspace S accommodating the clusters C_cIs { omega [ p ]]P ∈ C, where ω p]A local subspace feature vector for point p;

(b) and (3) carrying out similarity measurement on the subspace, specifically carrying out similarity measurement by adopting the following steps:

Wherein

(ii) using the eigenvectors obtained in step (i)

The cosine similarity of the two subspaces is measured;

(c) establishing a contrast interactive operation interface of the subspace according to the non-parametric representation and the similarity measurement of the subspace, thereby completing the visual analysis of the high-dimensional data; specifically, the following rules are adopted to establish an operation interface:

FIG. 2 shows a functional block diagram of the system of the present invention: the invention also discloses an analysis system for realizing the visual analysis method of the high-dimensional data, which comprises a local subspace difference-geodesic distance projection establishing module, a mapping establishing module of a cluster point cluster and a visual analysis view establishing module of a series of subspaces which are sequentially connected in series; the local subspace difference-geodesic distance projection establishing module is used for establishing local subspace difference-geodesic distance projection on the original high-dimensional data and uploading a mapping establishing module of the clustering point cluster; the mapping establishing module of the clustering point cluster is used for establishing the mapping of the clustering point cluster according to the local subspace difference-geodesic distance projection and uploading a visual analysis view establishing module of a series of subspaces; and the visual analysis view establishing module of the series subspace is used for establishing the visual analysis view of the series subspace according to the established mapping of the local subspace difference-geodesic distance projection and the clustering point cluster.

Claims

1. A visual analysis method of high-dimensional data comprises the following steps:

s1, establishing local subspace difference-geodesic distance projection on original high-dimensional data; specifically, the projection is established by adopting the following steps:

A. establishing a data point correlation measurement based on geodesic distance aiming at high-dimensional data needing projection; specifically, the following steps are adopted to establish the measurement:

calculating the shortest distance between any two points to obtain the geodesic distance;

B. b, establishing a local subspace difference metric according to the metric established in the step A; specifically, the following steps are adopted to establish the measurement:

1) the weights for each dimension are calculated using the following formula:

d_pq[W]＝max{d_pq[ω_p],d_pq[ω_q]}

wherein

d_pq[W]Is point p and pointq weighted distance matrix, ω_p＝[ω_p1,ω_p2,...,ω_pi,...,ω_pd]The feature vector, ω, representing the local subspace of the point p_q＝[ω_q1,ω_q2,...,ω_qi,...,ω_qd]Feature vector representing local subspace of q, d_iIs the Euclidean distance between a local subspace p and a point q in the ith dimension, d_pq[ω_p]Weighted distance of point q with respect to point p, d_pq[ω_q]Is the weighted distance of point p relative to point q;

in the formula

C. according to the measurement established in the steps A and B, establishing a local subspace difference-geodesic distance projection;

s2, establishing mapping of a clustering point cluster in the projection of the local subspace difference-geodesic distance; specifically, the mapping is established by adopting the following steps:

extracting the data point information which is selected in the local subspace difference-geodesic distance projection established in the step S1, wherein the selected data point is obtained by user interaction selection, and thus a data point set to be processed is obtained;

(II) calculating to obtain a characteristic vector of the data point set in the step (I) by using a PCA method;

s3, establishing a visual analysis view of a series of subspaces according to the local subspace difference-geodesic distance projection obtained in the step S1 and the mapping of the clustering point clusters obtained in the step S2; specifically, the method comprises the following steps of establishing a visual analysis view:

(a) performing non-parametric representation on the subspace; specifically, the following rules are adopted for non-parametric representation:

(b) performing similarity measurement on the subspaces; specifically, the following steps are adopted for similarity measurement:

Wherein

(ii) using the eigenvectors obtained in step (i)

The cosine similarity of the two subspaces is measured;

subspace list ordering interface: and giving a group of selected dimensions, judging the similarity of one-degree subspaces by using the feature vectors on the selected dimensions, and mapping the group of subspaces to 1-dimensional coordinate axes by using a 1-dimensional MDS algorithm to form a sorted list of the subspaces.

2. An analysis system for realizing the visual analysis method of the high-dimensional data as claimed in claim 1, which is characterized by comprising a local subspace difference-geodesic distance projection establishing module, a mapping establishing module of cluster point clusters and a visual analysis view establishing module of a series of subspaces which are connected in series in sequence; the local subspace difference-geodesic distance projection establishing module is used for establishing local subspace difference-geodesic distance projection on the original high-dimensional data and uploading a mapping establishing module of the clustering point cluster; the mapping establishing module of the clustering point cluster is used for establishing the mapping of the clustering point cluster according to the local subspace difference-geodesic distance projection and uploading a visual analysis view establishing module of a series of subspaces; and the visual analysis view establishing module of the series subspace is used for establishing the visual analysis view of the series subspace according to the established mapping of the local subspace difference-geodesic distance projection and the clustering point cluster.