CN111488502A

CN111488502A - Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout

Info

Publication number: CN111488502A
Application number: CN202010279193.4A
Authority: CN
Inventors: 牛奉高; 赵欣蕊
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-08-04

Abstract

The invention belongs to the technical field of visualization, and particularly relates to a low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout. According to the invention, correlation calculation among all dimensions is carried out on data, and layout is carried out according to the correlation calculation, then dimension subsets are divided according to layout results, and finally the dimension subsets are sequentially arranged and a parallel coordinate graph is constructed. The invention is based on an equidistant characteristic mapping method, and measures long distance by using a large amount of short distance, thereby carrying out better layout on each dimension and constructing a low-dimensional parallel coordinate graph. The finally obtained visual image can reduce errors caused by distance projection distortion, effectively utilizes the display space under the condition of keeping a large amount of effective information of the original data as much as possible, and presents a result which is convenient for extracting and reading the related dimension information.

Description

Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout

Technical Field

The invention belongs to the technical field of visualization, and particularly relates to a low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout.

Background

With the advent of the big data era, the amount of data generated by people is increasing, and the updating speed of the data is also increasing. The data volume is increased and the data itself is changed slightly due to the big data age. In the vast amount of data generated under modern information streams, a large part is high-dimensional data as compared with the past. It is difficult for people to obtain information directly from such data due to limitations in spatial imagination. Under such circumstances, how to effectively process high-dimensional data and how to obtain valuable information from the high-dimensional data becomes a problem to be solved.

Data Visualization (Data Visualization) has become an important approach to solving this problem today. Data visualization is a kind of visualization branch, which is a process of converting data information into a visual form, and is a scientific and technical research on the visual expression form of data. The data visualization converts high-dimensional data which is difficult to carry out spatial imagination into a graph which can be directly observed by people, and the conversion not only can enable people to quickly understand the surface information of the data, but also can fully utilize the insight of people, so that the logical relationship implied under the expression of mass data can be more easily deduced. As such, visualization of high-dimensional data has also become one of the current research directions for many scientists.

Disclosure of Invention

Aiming at the problems, the invention provides a low-dimensional parallel coordinate graph construction method based on the layout of the Isomap algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for constructing the low-dimensional parallel coordinate graph based on the Isomap algorithm layout comprises the following steps:

step S1, performing dimension correlation calculation and layout by using an Isomap algorithm;

step S2, dimension subset division;

in step S3, dimensions are sequentially arranged and a parallel coordinate graph is constructed.

Further, the dimension correlation calculation and layout by using the Isomap algorithm in the step 1 further includes the following steps:

s1.1, preprocessing data;

step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensions_i,d_j) Distance between (d)_i,d_j) And generates a distance matrix L;

and S1.3, inputting the obtained distance matrix into an MDS algorithm to obtain a two-dimensional scatter diagram layout of the dimension points.

Still further, the preprocessing method of the data in the step S1.1 is to clean the data, fill in missing values, establish a data set, and regard the value of the dimensionality of the data set as a vector;

if the data set D has n samples and the attributes of the samples are m-dimensional, the data set D and the ith item a_iThe expression is as follows:

D＝{a₁,a₂,…,a_n}

a_i＝{v_i1,v_i2,…,v_im}

wherein v is_ijA j-th dimension value representing an i-th item;

considering each numerical dimension as a vector, we obtain:

D＝{d₁,d₂,…,d_m}

d_j＝{v_1j,v_2j,…,v_nj}

wherein d is_jRepresenting the j-th dimension, and n is the number of samples.

Still further, said step S1.2 calculates respective pairs of dimensions (d) of the numerical dimensions_i,d_j) Distance between (d)_i,d_j) And the method of generating the distance matrix L further includes the steps of:

step S1.2.1, setting a proximity parameter k, i.e. d_jSetting the distance between the dimension points and the k adjacent dimension points as Euclidean distance;

dimension point d₁As a starting point, the set S ═ d is written₁The set U contains the division points d₁The other vertex, i.e., U ═ the rest of the vertices, if the dimension point U in the set U is d₁Is a neighboring dimension point of<d₁,u>Is denoted as dist (d)₁U) if u is not d₁Is a neighboring dimension point of<d₁,u>The distance of (d) is recorded as ∞;

step S1.2.2, from d₁A dimension point d with the minimum distance is selected from the k adjacent dimension points₂And d is₂Removing the collection U and adding the collection U into the collection S; to be provided withd₂For the newly considered starting point, the distance of the dimension points in the division set U is modified: if from the starting point d₁Through d₂The distance ratio to the dimension point u does not pass through the dimension point d₂If the distance of (d) is short, the distance value of the modified dimension point u is dist (d)₁,u)＝<d₁,d₂>+<d₂,u>；

Step S1.2.3, repeat step S1.2.2 until all dimensions are contained in S, find d₁Distances to all dimension points; finally calculating each dimension pair (d) of the numerical dimensions_i,d_j) Distance between (d)_i,d_j) And generates a distance matrix L, denoted as L ═ dis (d)_i,d_j))_m×m。

The distance matrix L obtained in step S1.2 is compared with a distance matrix obtained by calculating distances between pairs of dimensions using only euclidean distances, where the calculation of long distances has been replaced by estimation of intrinsic geodesic distances, so that the layout result calculated by the algorithm can reflect the strength of the correlation between dimensions more accurately, thereby reducing errors.

Still further, the step S1.3 is to input the obtained distance matrix into an MDS algorithm, and the method of obtaining the two-dimensional scattergram layout of the dimension points includes:

distance matrix

Wherein the content of the first and second substances,

representing a m × m-dimensional matrix of real numbers, elements dist (d) in the distance matrix_i,d_j) Is a dimension pair (d)_i,d_j) The distance between the two dimensional points, and the two dimensional position coordinates dist of each dimensional point²(d_i,·)，dist²(·,d_j)，dist²(·,·)；

Wherein,. represents d_iOr d_jI or j in (1) takes all values, i.e. from 1 to m.

Then, the inner product matrix B ═ B is obtained_ij)_m×mThe calculation mode of the elements in the matrix is as follows:

performing eigenvalue decomposition on the inner product matrix B, wherein B is V Λ V^TWherein Λ is a diagonal matrix formed by eigenvalues, V is an eigenvector matrix, and since two-dimensional position coordinates are required here, the diagonal matrix formed by two maximum eigenvalues is taken

And a feature vector matrix

Matrix array

And B ≈ Z^TZ, the matrix Z is the representation of the dimension points in a two-dimensional space, namely each row is a two-dimensional coordinate of one dimension point;

and (4) plotting the two-dimensional coordinates to prepare a scatter diagram, thus obtaining the final Isomap algorithm layout. The Isomap layout solves the problem that in the traditional layout method, the true distance of the dimension represented by the vertex with a close distance on the scatter plot is not close to that shown in the plot, namely the correlation of the corresponding dimension is not the strength shown in the plot, so that the influence of the error caused by the distance on the subsequent experimental process is weakened.

Further, the dimension subset division in step S2 further includes the following steps:

s2.1, constructing an undirected graph;

step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.

Still further, the method for constructing the undirected graph in the step S2.1 is as follows:

based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensions_i,d_j) Less than a user-defined threshold, the two dimensions are concatenated, otherwise they remain, eventually forming one or more undirected graphs. Therefore, if the threshold set by the user is larger, more connections are generated, and the number of nodes contained in the finally obtained undirected graph is increased. Conversely, if the threshold set by the user is smaller, fewer connections are generated, and the nodes included in the finally obtained undirected graph are also reduced, but the relevance of the nodes is inevitably higher.

Still further, the method for performing maximal clique detection by using the Bron-Kerbosch algorithm in the step S2.2 further comprises the following steps:

step S2.2.1, taking one of the undirected graphs, defining four sets R, P, X, N (v) for the node v in the undirected graph, wherein R: the node set in the cluster is initially an empty set;

p: a set of nodes, possibly in a clique, initially a full set;

x: the node set which is not considered is an empty set initially;

n (v): all adjacent node sets of node v;

step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set; if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, the set R is a subset of the found maximal cliques.

Step S2.2.3, backtracking to the last selected node, restoring the set R, P, X to the initial state, removing the node selected this time from the set P, adding the set X, selecting the next node from the set P, and repeating the operations of step S2.2.2 and step S2.2.3 until all the extremely large clusters are selected; the obtained multiple extremely large groups are multiple subsets divided by dimensionality.

Because the extremely large cliques are screened out, the dimensions corresponding to the vertexes in the cliques are all related dimensions within the given threshold of the user. Therefore, the freedom of arrangement of coordinate axes corresponding to the dimensions when a parallel coordinate graph is constructed later is guaranteed to a great extent.

Further, the dimension order arrangement and parallel coordinate graph construction in the step S3 further includes the following steps:

step S3.1, sorting the dimensions of the plurality of dimension subsets divided in the step S2 according to the greedy algorithm idea;

and S3.2, constructing a parallel coordinate graph according to the dimension sequence, and coloring the broken line to increase the definition of the visualization result.

Still further, in the step S3.1, for the plurality of dimension subsets divided in the step S2, the method of sorting the dimensions according to the greedy algorithm idea is: for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;

step S3.2 is to construct a parallel coordinate graph according to the dimension sequence, and perform polyline coloring to increase the definition of the visualization result: sequencing coordinate axes of a low-dimensional parallel coordinate graph to be constructed according to a dimension sequence, and drawing the parallel coordinate graph according to dimension data; meanwhile, the broken line color of the coordinate graph is defined according to the category selected by the user; if the data set does not have any classification dimension, in this case, the data samples are divided into groups using a plurality of clustering methods, and colors are assigned accordingly, wherein the number of groups is specified by the user.

The broken line is painted and can help the user to distinguish the broken line in the parallel coordinates picture, improves the information expression ability of parallel coordinates picture to reduce because the data bulk is too big, lead to the image too mixed and disorderly, and the multiwire clutter that produces etc. this instrument also can improve the aesthetic property of parallel coordinates picture simultaneously, reinforcing user's interest.

Compared with the prior art, the invention has the following advantages:

a "dimension set selection" scheme that groups similar or highly related dimensions into the same set. According to the method, the correlation among dimensions is laid out, the highly correlated dimensions are connected to generate the dimension map, the maximum group of the dimension map is extracted as the highly correlated dimension set, and a plurality of low-dimensional parallel coordinate maps are manufactured.

The invention is based on an isometric feature mapping (Isomap) method, and measures long distance by using a large amount of short distance, thereby carrying out better layout on each dimension and constructing a low-dimensional parallel coordinate graph. The finally obtained visual image can reduce errors caused by distance projection distortion, effectively utilizes the display space under the condition of keeping a large amount of effective information of the original data as much as possible, and presents a result which is convenient for extracting and reading the related dimension information.

Drawings

FIG. 1 is a graph of Isomap layout results with a threshold of 3;

FIG. 2 is an enlarged result plot at ① in FIG. 1 of an undirected graph formed in accordance with the threshold values;

FIG. 3 is an enlarged result plot at ② in FIG. 1 of an undirected graph formed in accordance with the threshold values;

FIG. 4 is a diagram of the visualization of the image segmentation data set at ① in FIG. 1;

FIG. 5 is a diagram of the visualization of the image segmentation data set at ② in FIG. 1;

FIG. 6 is a diagram of the results of an Isomap layout;

FIG. 7 is a graph of MDS layout results;

FIG. 8 is a parallel coordinate diagram constructed from an undirected graph A;

FIG. 9 is a parallel coordinate diagram constructed from the B undirected graph;

fig. 10 is a parallel coordinate diagram constructed by a C undirected graph.

Detailed Description

The following examples implement the layout and segmentation of dimensions using python3.6(64-bit) and the final construction of a parallel coordinates graph using RStudio; the whole process is executed on Windows 7(64 bits).

Example 1

s1.1, preprocessing data;

cleaning data, filling missing values, establishing a data set, and regarding the dimensionality value of the data set as a vector;

D＝{a₁,a₂,…,a_n}

a_i＝{v_i1,v_i2,…,v_im}

wherein v is_ijA j-th dimension value representing an i-th item;

considering each numerical dimension as a vector, we obtain:

D＝{d₁,d₂,…,d_m}

d_j＝{v_1j,v_2j,…,v_nj}

wherein d is_jRepresenting the j-th dimension, and n is the number of samples.

step S1.2.1, setting proximity parametersk, i.e. d_jSetting the distance between the dimension points and the k adjacent dimension points as Euclidean distance;

step S1.2.2, from d₁A dimension point d with the minimum distance is selected from the k adjacent dimension points₂And d is₂Removing the collection U and adding the collection U into the collection S; with d₂For the newly considered starting point, the distance of the dimension points in the division set U is modified: if from the starting point d₁Through d₂The distance ratio to the dimension point u does not pass through the dimension point d₂If the distance of (d) is short, the distance value of the modified dimension point u is dist (d)₁,u)＝<d₁,d₂>+<d₂,u>；

Step S1.2.3, repeat step S1.2.2 until all dimensions are included in set S, find d₁Distances to all dimension points; finally calculating each dimension pair (d) of the numerical dimensions_i,d_j) Distance between (d)_i,d_j) And generates a distance matrix L, denoted as L ═ dis (d)_i,d_j))_m×m。

S1.3, inputting the obtained distance matrix into an MDS algorithm to obtain a two-dimensional scatter diagram layout of the dimension points;

distance matrix

Wherein the content of the first and second substances,

representing a matrix of m × m-dimensional real numbers, in a distance matrixElement dist (d)_i,d_j) Is a dimension pair (d)_i,d_j) The distance between the two dimensional points, and the two dimensional position coordinates dist of each dimensional point²(d_i,·)，dist²(·,d_j)，dist²(·,·)；

Wherein,. represents d_iOr d_jI or j in (1) takes all values, i.e. from 1 to m.

And a feature vector matrix

Matrix array

and (4) plotting the two-dimensional coordinates to prepare a scatter diagram, thus obtaining the final Isomap algorithm layout.

Step S2, dimension subset division;

s2.1, constructing an undirected graph;

based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensions_i,d_j) Less than a user-defined threshold, the two dimensions are concatenated and otherwise remain the same. And finally forming one or more undirected graphs.

Step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.

p: a set of nodes, possibly in a clique, initially a full set;

x: the node set which is not considered is an empty set initially;

n (v): all adjacent node sets of node v;

step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set;

step S2.2.3, if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, the set R is a subset of the found maximal cliques; then, backtracking to the last selected node, restoring the set R, P, X to the initial state, removing the node selected this time from the set P, adding the set X, selecting the next node from the set P, and repeating the operations of the steps S2.2.2 and S2.2.3 until all the extremely large groups are selected; the obtained multiple extremely large groups are multiple subsets divided by dimensionality.

for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;

s3.2, constructing a parallel coordinate graph according to the dimension sequence, and coloring a broken line to increase the definition of a visualization result;

sequencing coordinate axes of a low-dimensional parallel coordinate graph to be constructed according to a dimension sequence, and drawing the parallel coordinate graph according to dimension data; meanwhile, the broken line color of the coordinate graph is defined according to the category selected by the user; if the data set does not have any classification dimension, in this case, the data samples are divided into groups using a plurality of clustering methods, and colors are assigned accordingly, wherein the number of groups is specified by the user.

Example 2

This embodiment data is from an image segmentation dataset in the UCI machine learning repository. The data was taken of 7 different outdoor pictures, each of which was manually divided into 30 blocks and 20 metric values were selected, resulting in a data set containing 20 characteristic values for 210 images. In this embodiment, each graph is taken as a class, the feature value is taken as a dimension, and each divided graph is taken as a sample. A 20-dimensional 210-sample dataset is formed.

The data are normalized according to the dimensionality, 20 dimensionalities of the data set are regarded as 20 vectors, the vectors are distributed on a two-dimensional plane by utilizing an Isomap algorithm, and a threshold value is selected according to requirements.

For clarity of layout results, dimensions are identified as X1 through X20. The data layout and dimension set selection results are shown in FIG. 1. As can be seen from FIG. 1, two undirected graphs are formed in the layout result of the data set, which shows that the dimension of the data set is divided into two subsets according to the selected threshold, and FIG. 2 and FIG. 3 are the results of enlarging the two undirected graphs respectively. And (3) screening the extremely large clusters by using a Bron-Kerbosch algorithm according to the undirected graphs formed by the dimension points displayed in the graph 2 and the graph 3, and arranging the dimensions corresponding to the points contained in the extremely large clusters according to the relevance between every two points and according to a greedy algorithm idea to obtain the dimension sequence of the parallel coordinate graph.

Obtaining a low-dimensional parallel coordinate graph formed by the divided subsets, as shown in fig. 4 and 5; the numbering of the coordinate axes in the figure has the following meanings:

x6: the result of a line extraction algorithm that calculates the number of lines with a length of 5 (arbitrary direction), a lower contrast and greater than 5;

x7: measuring the contrast (average) of horizontally adjacent pixels within the region;

x8: measuring the contrast (standard deviation) of horizontally adjacent pixels within the region;

x9: measuring the contrast (average) of vertically adjacent pixels within the region;

x10: measuring the contrast (standard deviation) of vertically adjacent pixels within the region;

x11: taking the average value of the (R + G + B)/3 areas;

x12: average value in the R region;

x13: average value in B region;

x14: average value in the G region;

x16: measuring the excess blue color (2B- (G + R));

x18: the RGB values are three-dimensionally non-linearly transformed using Foley and VanDam algorithms.

The 5 dimensions contained in the parallel graph of fig. 4 are basically measurements of the contrast of neighboring pixels, wherein the contrast of horizontally neighboring pixels can be regarded as a detector for vertical edges, and likewise, the contrast of vertically neighboring pixels can be regarded as a detector for horizontal edges, so that there is a correlation between these dimensions. For example, looking at the two dimensions X7 and X9 in fig. 4, it can be seen that the relationship between these two dimensions is a weak negative correlation.

For fig. 5, the included 6 dimensions are related to the image color, and the dimensions mean different calculations of R, G, B area values, so it can be guessed that there is strong correlation between these dimensions. This hypothesis is verified in fig. 5, where three sets of broken lines represent samples from three different images, and since the percentage of each set of regions in different images is different, three distinct classes appear in the parallel plots and are represented by set I, set II, and set III. As can be seen from fig. 5, the three sets of broken lines show trends for the left five dimensions, which are all strongly positively correlated. And for the two dimensions on the right, the correlation relationship of different categories is different. The relationship between the two dimensions of the category represented by the broken line in group I is strong negative correlation, the relationship presented in group III is weak negative correlation, and the relationship presented in group II is strong positive correlation. The reason for this is related to the original picture category distribution of the sample source.

Example 3

The data sources of this embodiment are: medical data informatics is an increasingly important research area in medicine, and visualization can improve verifiability of data by showing combinations of relevant dimensions corresponding to particular clinical outcomes. The present embodiment data is from the UCI medical data set for early stage chronic kidney disease. This data was collected from Apollo Hospital, India for a total of 400 data samples. Of these, 250 samples were patients and 150 were non-patients. The data set includes 24 index features, 13 categorical variables and 11 numerical variables. And completing preprocessing after the missing values in the data set are subjected to data completion. Finally, a 24-dimensional 400-sample data set is obtained.

The Isomap algorithm is based on MDS, and utilizes the shortest path algorithm to change the distance between dimensions from the Euclidean distance to a long distance consisting of short distances, namely the geodesic distance. Therefore, in the case of the same threshold, the dimension points included in the maximal clique formed by the layout obtained by the MDS algorithm may be more than those in the Isomap layout, but the short-distance dimension points sorted in this way do not necessarily represent dimensions with strong correlation.

In fig. 6 and fig. 7, layout results obtained by the MDS algorithm and the Isomap algorithm of the data set are shown, and for clarity of the layout results, dimensions are marked as X0 to X23, and here, an undirected graph formed in the upper right corner is mainly analyzed. As shown in fig. 6 and 7, in both layout methods, the undirected graphs of C and C include five dimensional points, i.e., X2, X12, X14, X15 and X17. However, because the calculated dimension distances of the two methods are different, under the condition of taking the same threshold value, the undirected graph structures are different, and then the extreme cliques screened by using the Bron-Kerbosch algorithm and the dimension orders obtained after ordering by the greedy algorithm idea are also different.

TABLE 4.1 layout and subset dimension order

As can be seen from table 4.1, the undirected graph obtained by the MDS layout finally screens out a very large cluster, forms a dimension subset, and contains all five dimensions. And the undirected graph obtained by the layout of the Isomap algorithm screens three extremely large clusters to form three dimensional subsets each comprising three dimensions, wherein the two dimensions of X14 and X15 appear in the three subsets.

TABLE 4.2 matrix of correlation coefficients

The correlation between the five dimensions is compared by using the correlation coefficient matrix, and it can be seen from table 4.2 that the correlation coefficient of X14 and X15 is the highest among the five dimensions, 0.857, and then X14 and X17, which are consistent with the obtained subset dimension order. In addition, the correlation coefficient between X2 and X17 is 0.660, and X2 and X15 are 0.684; the correlation coefficient between X12 and X2 is 0.538, while X12 and X14 are 0.581. In contrast, the three dimensional subsets obtained by the layout with the Isomap algorithm have stronger correlation among dimensions than one dimensional subset obtained by the MDS algorithm. That is, the Isomap algorithm has better layout effect than the MDS algorithm for the same threshold.

In the case of a threshold of 3, all parallel plots were constructed based on the Isomap layout. Fig. 8, 9, and 10 correspond to A, B and C undirected graphs in the layout, respectively. In order to save space, if the leftmost or rightmost of the two parallel coordinate graphs have coincident dimensions, the two parallel coordinate graphs are combined into one parallel coordinate graph. In the parallel coordinates, the diseased and non-diseased samples are clearly divided into two categories, where the line labeled I is the distribution of the non-diseased samples and the line labeled II is the distribution of the diseased samples. The distribution of non-diseased samples in any coordinate graph is more concentrated, the distribution interval of diseased samples is more dispersed, and the performance is related to that various body indexes of normal people are in a specified range.

Observing the details of the parallel coordinate graph, in the A group, X9, X10 and X11 are a pair of dimensions with strong correlation, and as can be seen from the graph, the relationship of the II group broken line between the two dimensions of X9 and X10 is weak negative correlation, and X10 and X11 are positive correlation. Wherein X9 means random blood glucose, X10 means blood urea, and X11 means serum creatinine. Creatinine is a small molecule substance and is excreted with urine after being filtered through a glomerular filter like haematurin, so that the relationship between X10 and X11 is in positive correlation. Renal failure often leads to decreased renal filtration and increased urine urea and serum creatinine levels, so the lines in group II are substantially above group I in these dimensions of the graph.

In the group C, three dimensions of X14, X15 and X17 are positively correlated, and all three dimensions are related to red blood cells, wherein X14 means the number of hemoglobin, X15 means the hematocrit, namely the ratio of the volume occupied by the red blood cells in blood, and X17 means the number of the red blood cells. Since patients with renal failure have a reduced capacity for their own heme production, the group II polyline representing the patient is below the group I polyline, and the positive correlation between the three dimensions is consistent with the medical rationale.

The invention utilizes two different data sets to show the feasibility of dividing the dimensionality subset into a plurality of low-dimensional parallel coordinate graphs consisting of related dimensionalities. And the results of the Isomap layout and the MDS layout are compared, and the dimension correlation coefficient is used as an index, so that the Isomap algorithm has a better layout effect than the MDS algorithm under the condition of the same threshold value. Besides, the displayed experimental results are analyzed and explained, which shows that the results obtained by the experimental data are all faithful to the actual situation.

Those skilled in the art will appreciate that the invention may be practiced without these specific details. Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. The method for constructing the low-dimensional parallel coordinate graph based on the Isomap algorithm layout is characterized by comprising the following steps of: the method comprises the following steps:

step S2, dimension subset division;

2. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: in the step 1, an Isomap algorithm is used for performing dimension correlation calculation and layout, and the method further comprises the following steps:

s1.1, preprocessing data;

3. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: s1.1, preprocessing the data by cleaning the data, filling missing values, establishing a data set and taking the dimension value of the data set as a vector;

D＝{a₁,a₂,…,a_n}

a_i＝{v_i1,v_i2,…,v_im}

wherein v is_ijA j-th dimension value representing an i-th item;

considering each numerical dimension as a vector, we obtain:

D＝{d₁,d₂,…,d_m}

d_j＝{v_1j,v_2j,…,v_nj}

wherein d is_jRepresenting the j-th dimension, and n is the number of samples.

4. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: said step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensions_i,d_j) Distance between (d)_i,d_j) And the method of generating the distance matrix L further includes the steps of:

dimension point d₁As a starting point, the set S ═ d is written₁The set U contains the division points d₁The other vertex, i.e., U ═ the rest of the vertices, if the dimension point U in the set U is d₁Of (2) neighbor dimensionPoint at, then<d₁,u>Is denoted as dist (d)₁U) if u is not d₁Is a neighboring dimension point of<d₁,u>The distance of (d) is recorded as ∞;

5. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: step S1.3 is to input the obtained distance matrix into the MDS algorithm, and the method of obtaining the two-dimensional scattergram layout of the dimension points is:

distance matrix

Wherein the content of the first and second substances,

Wherein,. represents d_iOr d_jI or j in (1) takes all values, i.e. from 1 to m;

performing eigenvalue decomposition on the inner product matrix B, wherein B is V Λ V^TWherein Λ is a diagonal matrix formed by eigenvalues, V is an eigenvector matrix, and since two-dimensional position coordinates are required here, the diagonal matrix formed by two largest eigenvalues is taken from Λ

And corresponding eigenvector matrix

Matrix array

6. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: the dimension subset division in step S2 further includes the following steps:

s2.1, constructing an undirected graph;

step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.

7. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 6, wherein: the method for constructing the undirected graph in the step S2.1 comprises the following steps: based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensions_i,d_j) If the two-dimensional degree is smaller than the threshold value defined by the user, the two-dimensional degrees are connected, otherwise, the two-dimensional degrees are kept in the original state; one or more undirected graphs are formed on the final scatter plot.

8. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 6, wherein: the step S2.2 of using the Bron-Kerbosch algorithm for maximum blob detection further comprises the steps of:

step S2.2.1, taking one of the undirected graphs, defining four sets R, P, X, N (v) for the node v in the undirected graph, wherein R: the node set in the cluster is initially an empty set; p: a set of nodes, possibly in a clique, initially a full set; x: the node set which is not considered is an empty set initially; n (v): all adjacent node sets of node v;

step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set; at this time, if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, it indicates that the set R is a subset of the found maximal cliques;

9. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: the dimension order in step S3 is to arrange and construct a parallel coordinate graph, and the method further includes the following steps:

10. The Isomap algorithm layout-based low-dimensional parallel coordinates graph construction method according to claim 9, wherein: in the step S3.1, for the multiple dimension subsets divided in the step S2, the method of sorting the dimensions according to the greedy algorithm idea is as follows: for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;