CN111488502A - Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout - Google Patents

Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout Download PDF

Info

Publication number
CN111488502A
CN111488502A CN202010279193.4A CN202010279193A CN111488502A CN 111488502 A CN111488502 A CN 111488502A CN 202010279193 A CN202010279193 A CN 202010279193A CN 111488502 A CN111488502 A CN 111488502A
Authority
CN
China
Prior art keywords
dimension
dimensional
distance
layout
dimensions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010279193.4A
Other languages
Chinese (zh)
Inventor
牛奉高
赵欣蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN202010279193.4A priority Critical patent/CN111488502A/en
Publication of CN111488502A publication Critical patent/CN111488502A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of visualization, and particularly relates to a low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout. According to the invention, correlation calculation among all dimensions is carried out on data, and layout is carried out according to the correlation calculation, then dimension subsets are divided according to layout results, and finally the dimension subsets are sequentially arranged and a parallel coordinate graph is constructed. The invention is based on an equidistant characteristic mapping method, and measures long distance by using a large amount of short distance, thereby carrying out better layout on each dimension and constructing a low-dimensional parallel coordinate graph. The finally obtained visual image can reduce errors caused by distance projection distortion, effectively utilizes the display space under the condition of keeping a large amount of effective information of the original data as much as possible, and presents a result which is convenient for extracting and reading the related dimension information.

Description

Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout
Technical Field
The invention belongs to the technical field of visualization, and particularly relates to a low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout.
Background
With the advent of the big data era, the amount of data generated by people is increasing, and the updating speed of the data is also increasing. The data volume is increased and the data itself is changed slightly due to the big data age. In the vast amount of data generated under modern information streams, a large part is high-dimensional data as compared with the past. It is difficult for people to obtain information directly from such data due to limitations in spatial imagination. Under such circumstances, how to effectively process high-dimensional data and how to obtain valuable information from the high-dimensional data becomes a problem to be solved.
Data Visualization (Data Visualization) has become an important approach to solving this problem today. Data visualization is a kind of visualization branch, which is a process of converting data information into a visual form, and is a scientific and technical research on the visual expression form of data. The data visualization converts high-dimensional data which is difficult to carry out spatial imagination into a graph which can be directly observed by people, and the conversion not only can enable people to quickly understand the surface information of the data, but also can fully utilize the insight of people, so that the logical relationship implied under the expression of mass data can be more easily deduced. As such, visualization of high-dimensional data has also become one of the current research directions for many scientists.
Disclosure of Invention
Aiming at the problems, the invention provides a low-dimensional parallel coordinate graph construction method based on the layout of the Isomap algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for constructing the low-dimensional parallel coordinate graph based on the Isomap algorithm layout comprises the following steps:
step S1, performing dimension correlation calculation and layout by using an Isomap algorithm;
step S2, dimension subset division;
in step S3, dimensions are sequentially arranged and a parallel coordinate graph is constructed.
Further, the dimension correlation calculation and layout by using the Isomap algorithm in the step 1 further includes the following steps:
s1.1, preprocessing data;
step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L;
and S1.3, inputting the obtained distance matrix into an MDS algorithm to obtain a two-dimensional scatter diagram layout of the dimension points.
Still further, the preprocessing method of the data in the step S1.1 is to clean the data, fill in missing values, establish a data set, and regard the value of the dimensionality of the data set as a vector;
if the data set D has n samples and the attributes of the samples are m-dimensional, the data set D and the ith item aiThe expression is as follows:
D={a1,a2,…,an}
ai={vi1,vi2,…,vim}
wherein v isijA j-th dimension value representing an i-th item;
considering each numerical dimension as a vector, we obtain:
D={d1,d2,…,dm}
dj={v1j,v2j,…,vnj}
wherein d isjRepresenting the j-th dimension, and n is the number of samples.
Still further, said step S1.2 calculates respective pairs of dimensions (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And the method of generating the distance matrix L further includes the steps of:
step S1.2.1, setting a proximity parameter k, i.e. djSetting the distance between the dimension points and the k adjacent dimension points as Euclidean distance;
Figure BDA0002445912990000031
dimension point d1As a starting point, the set S ═ d is written1The set U contains the division points d1The other vertex, i.e., U ═ the rest of the vertices, if the dimension point U in the set U is d1Is a neighboring dimension point of<d1,u>Is denoted as dist (d)1U) if u is not d1Is a neighboring dimension point of<d1,u>The distance of (d) is recorded as ∞;
step S1.2.2, from d1A dimension point d with the minimum distance is selected from the k adjacent dimension points2And d is2Removing the collection U and adding the collection U into the collection S; to be provided withd2For the newly considered starting point, the distance of the dimension points in the division set U is modified: if from the starting point d1Through d2The distance ratio to the dimension point u does not pass through the dimension point d2If the distance of (d) is short, the distance value of the modified dimension point u is dist (d)1,u)=<d1,d2>+<d2,u>;
Step S1.2.3, repeat step S1.2.2 until all dimensions are contained in S, find d1Distances to all dimension points; finally calculating each dimension pair (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L, denoted as L ═ dis (d)i,dj))m×m
The distance matrix L obtained in step S1.2 is compared with a distance matrix obtained by calculating distances between pairs of dimensions using only euclidean distances, where the calculation of long distances has been replaced by estimation of intrinsic geodesic distances, so that the layout result calculated by the algorithm can reflect the strength of the correlation between dimensions more accurately, thereby reducing errors.
Still further, the step S1.3 is to input the obtained distance matrix into an MDS algorithm, and the method of obtaining the two-dimensional scattergram layout of the dimension points includes:
distance matrix
Figure BDA0002445912990000032
Wherein the content of the first and second substances,
Figure BDA0002445912990000033
representing a m × m-dimensional matrix of real numbers, elements dist (d) in the distance matrixi,dj) Is a dimension pair (d)i,dj) The distance between the two dimensional points, and the two dimensional position coordinates dist of each dimensional point2(di,·),dist2(·,dj),dist2(·,·);
Figure BDA0002445912990000041
Figure BDA0002445912990000042
Figure BDA0002445912990000043
Wherein,. represents diOr djI or j in (1) takes all values, i.e. from 1 to m.
Then, the inner product matrix B ═ B is obtainedij)m×mThe calculation mode of the elements in the matrix is as follows:
Figure BDA0002445912990000044
performing eigenvalue decomposition on the inner product matrix B, wherein B is V Λ VTWherein Λ is a diagonal matrix formed by eigenvalues, V is an eigenvector matrix, and since two-dimensional position coordinates are required here, the diagonal matrix formed by two maximum eigenvalues is taken
Figure BDA0002445912990000045
And a feature vector matrix
Figure BDA0002445912990000046
Matrix array
Figure BDA0002445912990000047
And B ≈ ZTZ, the matrix Z is the representation of the dimension points in a two-dimensional space, namely each row is a two-dimensional coordinate of one dimension point;
and (4) plotting the two-dimensional coordinates to prepare a scatter diagram, thus obtaining the final Isomap algorithm layout. The Isomap layout solves the problem that in the traditional layout method, the true distance of the dimension represented by the vertex with a close distance on the scatter plot is not close to that shown in the plot, namely the correlation of the corresponding dimension is not the strength shown in the plot, so that the influence of the error caused by the distance on the subsequent experimental process is weakened.
Further, the dimension subset division in step S2 further includes the following steps:
s2.1, constructing an undirected graph;
step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.
Still further, the method for constructing the undirected graph in the step S2.1 is as follows:
based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensionsi,dj) Less than a user-defined threshold, the two dimensions are concatenated, otherwise they remain, eventually forming one or more undirected graphs. Therefore, if the threshold set by the user is larger, more connections are generated, and the number of nodes contained in the finally obtained undirected graph is increased. Conversely, if the threshold set by the user is smaller, fewer connections are generated, and the nodes included in the finally obtained undirected graph are also reduced, but the relevance of the nodes is inevitably higher.
Still further, the method for performing maximal clique detection by using the Bron-Kerbosch algorithm in the step S2.2 further comprises the following steps:
step S2.2.1, taking one of the undirected graphs, defining four sets R, P, X, N (v) for the node v in the undirected graph, wherein R: the node set in the cluster is initially an empty set;
p: a set of nodes, possibly in a clique, initially a full set;
x: the node set which is not considered is an empty set initially;
n (v): all adjacent node sets of node v;
step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set; if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, the set R is a subset of the found maximal cliques.
Step S2.2.3, backtracking to the last selected node, restoring the set R, P, X to the initial state, removing the node selected this time from the set P, adding the set X, selecting the next node from the set P, and repeating the operations of step S2.2.2 and step S2.2.3 until all the extremely large clusters are selected; the obtained multiple extremely large groups are multiple subsets divided by dimensionality.
Because the extremely large cliques are screened out, the dimensions corresponding to the vertexes in the cliques are all related dimensions within the given threshold of the user. Therefore, the freedom of arrangement of coordinate axes corresponding to the dimensions when a parallel coordinate graph is constructed later is guaranteed to a great extent.
Further, the dimension order arrangement and parallel coordinate graph construction in the step S3 further includes the following steps:
step S3.1, sorting the dimensions of the plurality of dimension subsets divided in the step S2 according to the greedy algorithm idea;
and S3.2, constructing a parallel coordinate graph according to the dimension sequence, and coloring the broken line to increase the definition of the visualization result.
Still further, in the step S3.1, for the plurality of dimension subsets divided in the step S2, the method of sorting the dimensions according to the greedy algorithm idea is: for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;
step S3.2 is to construct a parallel coordinate graph according to the dimension sequence, and perform polyline coloring to increase the definition of the visualization result: sequencing coordinate axes of a low-dimensional parallel coordinate graph to be constructed according to a dimension sequence, and drawing the parallel coordinate graph according to dimension data; meanwhile, the broken line color of the coordinate graph is defined according to the category selected by the user; if the data set does not have any classification dimension, in this case, the data samples are divided into groups using a plurality of clustering methods, and colors are assigned accordingly, wherein the number of groups is specified by the user.
The broken line is painted and can help the user to distinguish the broken line in the parallel coordinates picture, improves the information expression ability of parallel coordinates picture to reduce because the data bulk is too big, lead to the image too mixed and disorderly, and the multiwire clutter that produces etc. this instrument also can improve the aesthetic property of parallel coordinates picture simultaneously, reinforcing user's interest.
Compared with the prior art, the invention has the following advantages:
a "dimension set selection" scheme that groups similar or highly related dimensions into the same set. According to the method, the correlation among dimensions is laid out, the highly correlated dimensions are connected to generate the dimension map, the maximum group of the dimension map is extracted as the highly correlated dimension set, and a plurality of low-dimensional parallel coordinate maps are manufactured.
The invention is based on an isometric feature mapping (Isomap) method, and measures long distance by using a large amount of short distance, thereby carrying out better layout on each dimension and constructing a low-dimensional parallel coordinate graph. The finally obtained visual image can reduce errors caused by distance projection distortion, effectively utilizes the display space under the condition of keeping a large amount of effective information of the original data as much as possible, and presents a result which is convenient for extracting and reading the related dimension information.
Drawings
FIG. 1 is a graph of Isomap layout results with a threshold of 3;
FIG. 2 is an enlarged result plot at ① in FIG. 1 of an undirected graph formed in accordance with the threshold values;
FIG. 3 is an enlarged result plot at ② in FIG. 1 of an undirected graph formed in accordance with the threshold values;
FIG. 4 is a diagram of the visualization of the image segmentation data set at ① in FIG. 1;
FIG. 5 is a diagram of the visualization of the image segmentation data set at ② in FIG. 1;
FIG. 6 is a diagram of the results of an Isomap layout;
FIG. 7 is a graph of MDS layout results;
FIG. 8 is a parallel coordinate diagram constructed from an undirected graph A;
FIG. 9 is a parallel coordinate diagram constructed from the B undirected graph;
fig. 10 is a parallel coordinate diagram constructed by a C undirected graph.
Detailed Description
The following examples implement the layout and segmentation of dimensions using python3.6(64-bit) and the final construction of a parallel coordinates graph using RStudio; the whole process is executed on Windows 7(64 bits).
Example 1
The method for constructing the low-dimensional parallel coordinate graph based on the Isomap algorithm layout comprises the following steps:
step S1, performing dimension correlation calculation and layout by using an Isomap algorithm;
s1.1, preprocessing data;
cleaning data, filling missing values, establishing a data set, and regarding the dimensionality value of the data set as a vector;
if the data set D has n samples and the attributes of the samples are m-dimensional, the data set D and the ith item aiThe expression is as follows:
D={a1,a2,…,an}
ai={vi1,vi2,…,vim}
wherein v isijA j-th dimension value representing an i-th item;
considering each numerical dimension as a vector, we obtain:
D={d1,d2,…,dm}
dj={v1j,v2j,…,vnj}
wherein d isjRepresenting the j-th dimension, and n is the number of samples.
Step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L;
step S1.2.1, setting proximity parametersk, i.e. djSetting the distance between the dimension points and the k adjacent dimension points as Euclidean distance;
Figure BDA0002445912990000081
dimension point d1As a starting point, the set S ═ d is written1The set U contains the division points d1The other vertex, i.e., U ═ the rest of the vertices, if the dimension point U in the set U is d1Is a neighboring dimension point of<d1,u>Is denoted as dist (d)1U) if u is not d1Is a neighboring dimension point of<d1,u>The distance of (d) is recorded as ∞;
step S1.2.2, from d1A dimension point d with the minimum distance is selected from the k adjacent dimension points2And d is2Removing the collection U and adding the collection U into the collection S; with d2For the newly considered starting point, the distance of the dimension points in the division set U is modified: if from the starting point d1Through d2The distance ratio to the dimension point u does not pass through the dimension point d2If the distance of (d) is short, the distance value of the modified dimension point u is dist (d)1,u)=<d1,d2>+<d2,u>;
Step S1.2.3, repeat step S1.2.2 until all dimensions are included in set S, find d1Distances to all dimension points; finally calculating each dimension pair (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L, denoted as L ═ dis (d)i,dj))m×m
S1.3, inputting the obtained distance matrix into an MDS algorithm to obtain a two-dimensional scatter diagram layout of the dimension points;
distance matrix
Figure BDA0002445912990000091
Wherein the content of the first and second substances,
Figure BDA0002445912990000092
representing a matrix of m × m-dimensional real numbers, in a distance matrixElement dist (d)i,dj) Is a dimension pair (d)i,dj) The distance between the two dimensional points, and the two dimensional position coordinates dist of each dimensional point2(di,·),dist2(·,dj),dist2(·,·);
Figure BDA0002445912990000093
Figure BDA0002445912990000094
Figure BDA0002445912990000095
Wherein,. represents diOr djI or j in (1) takes all values, i.e. from 1 to m.
Then, the inner product matrix B ═ B is obtainedij)m×mThe calculation mode of the elements in the matrix is as follows:
Figure BDA0002445912990000096
performing eigenvalue decomposition on the inner product matrix B, wherein B is V Λ VTWherein Λ is a diagonal matrix formed by eigenvalues, V is an eigenvector matrix, and since two-dimensional position coordinates are required here, the diagonal matrix formed by two maximum eigenvalues is taken
Figure BDA0002445912990000097
And a feature vector matrix
Figure BDA0002445912990000098
Matrix array
Figure BDA0002445912990000099
And B ≈ ZTZ, the matrix Z is the representation of the dimension points in a two-dimensional space, namely each row is a two-dimensional coordinate of one dimension point;
and (4) plotting the two-dimensional coordinates to prepare a scatter diagram, thus obtaining the final Isomap algorithm layout.
Step S2, dimension subset division;
s2.1, constructing an undirected graph;
based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensionsi,dj) Less than a user-defined threshold, the two dimensions are concatenated and otherwise remain the same. And finally forming one or more undirected graphs.
Step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.
Step S2.2.1, taking one of the undirected graphs, defining four sets R, P, X, N (v) for the node v in the undirected graph, wherein R: the node set in the cluster is initially an empty set;
p: a set of nodes, possibly in a clique, initially a full set;
x: the node set which is not considered is an empty set initially;
n (v): all adjacent node sets of node v;
step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set;
step S2.2.3, if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, the set R is a subset of the found maximal cliques; then, backtracking to the last selected node, restoring the set R, P, X to the initial state, removing the node selected this time from the set P, adding the set X, selecting the next node from the set P, and repeating the operations of the steps S2.2.2 and S2.2.3 until all the extremely large groups are selected; the obtained multiple extremely large groups are multiple subsets divided by dimensionality.
In step S3, dimensions are sequentially arranged and a parallel coordinate graph is constructed.
Step S3.1, sorting the dimensions of the plurality of dimension subsets divided in the step S2 according to the greedy algorithm idea;
for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;
s3.2, constructing a parallel coordinate graph according to the dimension sequence, and coloring a broken line to increase the definition of a visualization result;
sequencing coordinate axes of a low-dimensional parallel coordinate graph to be constructed according to a dimension sequence, and drawing the parallel coordinate graph according to dimension data; meanwhile, the broken line color of the coordinate graph is defined according to the category selected by the user; if the data set does not have any classification dimension, in this case, the data samples are divided into groups using a plurality of clustering methods, and colors are assigned accordingly, wherein the number of groups is specified by the user.
Example 2
This embodiment data is from an image segmentation dataset in the UCI machine learning repository. The data was taken of 7 different outdoor pictures, each of which was manually divided into 30 blocks and 20 metric values were selected, resulting in a data set containing 20 characteristic values for 210 images. In this embodiment, each graph is taken as a class, the feature value is taken as a dimension, and each divided graph is taken as a sample. A 20-dimensional 210-sample dataset is formed.
The data are normalized according to the dimensionality, 20 dimensionalities of the data set are regarded as 20 vectors, the vectors are distributed on a two-dimensional plane by utilizing an Isomap algorithm, and a threshold value is selected according to requirements.
For clarity of layout results, dimensions are identified as X1 through X20. The data layout and dimension set selection results are shown in FIG. 1. As can be seen from FIG. 1, two undirected graphs are formed in the layout result of the data set, which shows that the dimension of the data set is divided into two subsets according to the selected threshold, and FIG. 2 and FIG. 3 are the results of enlarging the two undirected graphs respectively. And (3) screening the extremely large clusters by using a Bron-Kerbosch algorithm according to the undirected graphs formed by the dimension points displayed in the graph 2 and the graph 3, and arranging the dimensions corresponding to the points contained in the extremely large clusters according to the relevance between every two points and according to a greedy algorithm idea to obtain the dimension sequence of the parallel coordinate graph.
Obtaining a low-dimensional parallel coordinate graph formed by the divided subsets, as shown in fig. 4 and 5; the numbering of the coordinate axes in the figure has the following meanings:
x6: the result of a line extraction algorithm that calculates the number of lines with a length of 5 (arbitrary direction), a lower contrast and greater than 5;
x7: measuring the contrast (average) of horizontally adjacent pixels within the region;
x8: measuring the contrast (standard deviation) of horizontally adjacent pixels within the region;
x9: measuring the contrast (average) of vertically adjacent pixels within the region;
x10: measuring the contrast (standard deviation) of vertically adjacent pixels within the region;
x11: taking the average value of the (R + G + B)/3 areas;
x12: average value in the R region;
x13: average value in B region;
x14: average value in the G region;
x16: measuring the excess blue color (2B- (G + R));
x18: the RGB values are three-dimensionally non-linearly transformed using Foley and VanDam algorithms.
The 5 dimensions contained in the parallel graph of fig. 4 are basically measurements of the contrast of neighboring pixels, wherein the contrast of horizontally neighboring pixels can be regarded as a detector for vertical edges, and likewise, the contrast of vertically neighboring pixels can be regarded as a detector for horizontal edges, so that there is a correlation between these dimensions. For example, looking at the two dimensions X7 and X9 in fig. 4, it can be seen that the relationship between these two dimensions is a weak negative correlation.
For fig. 5, the included 6 dimensions are related to the image color, and the dimensions mean different calculations of R, G, B area values, so it can be guessed that there is strong correlation between these dimensions. This hypothesis is verified in fig. 5, where three sets of broken lines represent samples from three different images, and since the percentage of each set of regions in different images is different, three distinct classes appear in the parallel plots and are represented by set I, set II, and set III. As can be seen from fig. 5, the three sets of broken lines show trends for the left five dimensions, which are all strongly positively correlated. And for the two dimensions on the right, the correlation relationship of different categories is different. The relationship between the two dimensions of the category represented by the broken line in group I is strong negative correlation, the relationship presented in group III is weak negative correlation, and the relationship presented in group II is strong positive correlation. The reason for this is related to the original picture category distribution of the sample source.
Example 3
The data sources of this embodiment are: medical data informatics is an increasingly important research area in medicine, and visualization can improve verifiability of data by showing combinations of relevant dimensions corresponding to particular clinical outcomes. The present embodiment data is from the UCI medical data set for early stage chronic kidney disease. This data was collected from Apollo Hospital, India for a total of 400 data samples. Of these, 250 samples were patients and 150 were non-patients. The data set includes 24 index features, 13 categorical variables and 11 numerical variables. And completing preprocessing after the missing values in the data set are subjected to data completion. Finally, a 24-dimensional 400-sample data set is obtained.
The Isomap algorithm is based on MDS, and utilizes the shortest path algorithm to change the distance between dimensions from the Euclidean distance to a long distance consisting of short distances, namely the geodesic distance. Therefore, in the case of the same threshold, the dimension points included in the maximal clique formed by the layout obtained by the MDS algorithm may be more than those in the Isomap layout, but the short-distance dimension points sorted in this way do not necessarily represent dimensions with strong correlation.
In fig. 6 and fig. 7, layout results obtained by the MDS algorithm and the Isomap algorithm of the data set are shown, and for clarity of the layout results, dimensions are marked as X0 to X23, and here, an undirected graph formed in the upper right corner is mainly analyzed. As shown in fig. 6 and 7, in both layout methods, the undirected graphs of C and C include five dimensional points, i.e., X2, X12, X14, X15 and X17. However, because the calculated dimension distances of the two methods are different, under the condition of taking the same threshold value, the undirected graph structures are different, and then the extreme cliques screened by using the Bron-Kerbosch algorithm and the dimension orders obtained after ordering by the greedy algorithm idea are also different.
TABLE 4.1 layout and subset dimension order
Figure BDA0002445912990000141
As can be seen from table 4.1, the undirected graph obtained by the MDS layout finally screens out a very large cluster, forms a dimension subset, and contains all five dimensions. And the undirected graph obtained by the layout of the Isomap algorithm screens three extremely large clusters to form three dimensional subsets each comprising three dimensions, wherein the two dimensions of X14 and X15 appear in the three subsets.
TABLE 4.2 matrix of correlation coefficients
Figure BDA0002445912990000142
The correlation between the five dimensions is compared by using the correlation coefficient matrix, and it can be seen from table 4.2 that the correlation coefficient of X14 and X15 is the highest among the five dimensions, 0.857, and then X14 and X17, which are consistent with the obtained subset dimension order. In addition, the correlation coefficient between X2 and X17 is 0.660, and X2 and X15 are 0.684; the correlation coefficient between X12 and X2 is 0.538, while X12 and X14 are 0.581. In contrast, the three dimensional subsets obtained by the layout with the Isomap algorithm have stronger correlation among dimensions than one dimensional subset obtained by the MDS algorithm. That is, the Isomap algorithm has better layout effect than the MDS algorithm for the same threshold.
In the case of a threshold of 3, all parallel plots were constructed based on the Isomap layout. Fig. 8, 9, and 10 correspond to A, B and C undirected graphs in the layout, respectively. In order to save space, if the leftmost or rightmost of the two parallel coordinate graphs have coincident dimensions, the two parallel coordinate graphs are combined into one parallel coordinate graph. In the parallel coordinates, the diseased and non-diseased samples are clearly divided into two categories, where the line labeled I is the distribution of the non-diseased samples and the line labeled II is the distribution of the diseased samples. The distribution of non-diseased samples in any coordinate graph is more concentrated, the distribution interval of diseased samples is more dispersed, and the performance is related to that various body indexes of normal people are in a specified range.
Observing the details of the parallel coordinate graph, in the A group, X9, X10 and X11 are a pair of dimensions with strong correlation, and as can be seen from the graph, the relationship of the II group broken line between the two dimensions of X9 and X10 is weak negative correlation, and X10 and X11 are positive correlation. Wherein X9 means random blood glucose, X10 means blood urea, and X11 means serum creatinine. Creatinine is a small molecule substance and is excreted with urine after being filtered through a glomerular filter like haematurin, so that the relationship between X10 and X11 is in positive correlation. Renal failure often leads to decreased renal filtration and increased urine urea and serum creatinine levels, so the lines in group II are substantially above group I in these dimensions of the graph.
In the group C, three dimensions of X14, X15 and X17 are positively correlated, and all three dimensions are related to red blood cells, wherein X14 means the number of hemoglobin, X15 means the hematocrit, namely the ratio of the volume occupied by the red blood cells in blood, and X17 means the number of the red blood cells. Since patients with renal failure have a reduced capacity for their own heme production, the group II polyline representing the patient is below the group I polyline, and the positive correlation between the three dimensions is consistent with the medical rationale.
The invention utilizes two different data sets to show the feasibility of dividing the dimensionality subset into a plurality of low-dimensional parallel coordinate graphs consisting of related dimensionalities. And the results of the Isomap layout and the MDS layout are compared, and the dimension correlation coefficient is used as an index, so that the Isomap algorithm has a better layout effect than the MDS algorithm under the condition of the same threshold value. Besides, the displayed experimental results are analyzed and explained, which shows that the results obtained by the experimental data are all faithful to the actual situation.
Those skilled in the art will appreciate that the invention may be practiced without these specific details. Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (10)

1. The method for constructing the low-dimensional parallel coordinate graph based on the Isomap algorithm layout is characterized by comprising the following steps of: the method comprises the following steps:
step S1, performing dimension correlation calculation and layout by using an Isomap algorithm;
step S2, dimension subset division;
in step S3, dimensions are sequentially arranged and a parallel coordinate graph is constructed.
2. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: in the step 1, an Isomap algorithm is used for performing dimension correlation calculation and layout, and the method further comprises the following steps:
s1.1, preprocessing data;
step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L;
and S1.3, inputting the obtained distance matrix into an MDS algorithm to obtain a two-dimensional scatter diagram layout of the dimension points.
3. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: s1.1, preprocessing the data by cleaning the data, filling missing values, establishing a data set and taking the dimension value of the data set as a vector;
if the data set D has n samples and the attributes of the samples are m-dimensional, the data set D and the ith item aiThe expression is as follows:
D={a1,a2,…,an}
ai={vi1,vi2,…,vim}
wherein v isijA j-th dimension value representing an i-th item;
considering each numerical dimension as a vector, we obtain:
D={d1,d2,…,dm}
dj={v1j,v2j,…,vnj}
wherein d isjRepresenting the j-th dimension, and n is the number of samples.
4. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: said step S1.2 calculating respective pairs of dimensions (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And the method of generating the distance matrix L further includes the steps of:
step S1.2.1, setting a proximity parameter k, i.e. djSetting the distance between the dimension points and the k adjacent dimension points as Euclidean distance;
Figure FDA0002445912980000021
dimension point d1As a starting point, the set S ═ d is written1The set U contains the division points d1The other vertex, i.e., U ═ the rest of the vertices, if the dimension point U in the set U is d1Of (2) neighbor dimensionPoint at, then<d1,u>Is denoted as dist (d)1U) if u is not d1Is a neighboring dimension point of<d1,u>The distance of (d) is recorded as ∞;
step S1.2.2, from d1A dimension point d with the minimum distance is selected from the k adjacent dimension points2And d is2Removing the collection U and adding the collection U into the collection S; with d2For the newly considered starting point, the distance of the dimension points in the division set U is modified: if from the starting point d1Through d2The distance ratio to the dimension point u does not pass through the dimension point d2If the distance of (d) is short, the distance value of the modified dimension point u is dist (d)1,u)=<d1,d2>+<d2,u>;
Step S1.2.3, repeat step S1.2.2 until all dimensions are contained in S, find d1Distances to all dimension points; finally calculating each dimension pair (d) of the numerical dimensionsi,dj) Distance between (d)i,dj) And generates a distance matrix L, denoted as L ═ dis (d)i,dj))m×m
5. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 2, wherein: step S1.3 is to input the obtained distance matrix into the MDS algorithm, and the method of obtaining the two-dimensional scattergram layout of the dimension points is:
distance matrix
Figure FDA0002445912980000022
Wherein the content of the first and second substances,
Figure FDA0002445912980000023
representing a m × m-dimensional matrix of real numbers, elements dist (d) in the distance matrixi,dj) Is a dimension pair (d)i,dj) The distance between the two dimensional points, and the two dimensional position coordinates dist of each dimensional point2(di,·),dist2(·,dj),dist2(·,·);
Figure FDA0002445912980000031
Figure FDA0002445912980000032
Figure FDA0002445912980000033
Wherein,. represents diOr djI or j in (1) takes all values, i.e. from 1 to m;
then, the inner product matrix B ═ B is obtainedij)m×mThe calculation mode of the elements in the matrix is as follows:
Figure FDA0002445912980000034
performing eigenvalue decomposition on the inner product matrix B, wherein B is V Λ VTWherein Λ is a diagonal matrix formed by eigenvalues, V is an eigenvector matrix, and since two-dimensional position coordinates are required here, the diagonal matrix formed by two largest eigenvalues is taken from Λ
Figure FDA0002445912980000035
And corresponding eigenvector matrix
Figure FDA0002445912980000036
Matrix array
Figure FDA0002445912980000037
And B ≈ ZTZ, the matrix Z is the representation of the dimension points in a two-dimensional space, namely each row is a two-dimensional coordinate of one dimension point;
and (4) plotting the two-dimensional coordinates to prepare a scatter diagram, thus obtaining the final Isomap algorithm layout.
6. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: the dimension subset division in step S2 further includes the following steps:
s2.1, constructing an undirected graph;
step S2.2 utilizes a Bron-Kerbosch algorithm for maximum clique detection.
7. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 6, wherein: the method for constructing the undirected graph in the step S2.1 comprises the following steps: based on a two-dimensional scatter distribution diagram and a distance matrix, a threshold is customized according to requirements, if the distance dist (d) between dimensionsi,dj) If the two-dimensional degree is smaller than the threshold value defined by the user, the two-dimensional degrees are connected, otherwise, the two-dimensional degrees are kept in the original state; one or more undirected graphs are formed on the final scatter plot.
8. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 6, wherein: the step S2.2 of using the Bron-Kerbosch algorithm for maximum blob detection further comprises the steps of:
step S2.2.1, taking one of the undirected graphs, defining four sets R, P, X, N (v) for the node v in the undirected graph, wherein R: the node set in the cluster is initially an empty set; p: a set of nodes, possibly in a clique, initially a full set; x: the node set which is not considered is an empty set initially; n (v): all adjacent node sets of node v;
step S2.2.2, first selecting a node v from the set P, and searching for a maximum clique containing v; placing v in set R and removing nodes not in N (v) from set P and set X; finally, selecting a node from the remaining set P, and repeating the operation of step S2.2.2 until P becomes an empty set; at this time, if the set X is also an empty set, the set R is a new maximal clique, and if the set X is not empty, it indicates that the set R is a subset of the found maximal cliques;
step S2.2.3, backtracking to the last selected node, restoring the set R, P, X to the initial state, removing the node selected this time from the set P, adding the set X, selecting the next node from the set P, and repeating the operations of step S2.2.2 and step S2.2.3 until all the extremely large clusters are selected; the obtained multiple extremely large groups are multiple subsets divided by dimensionality.
9. The Isomap algorithm layout-based low-dimensional parallel coordinate graph construction method according to claim 1, wherein: the dimension order in step S3 is to arrange and construct a parallel coordinate graph, and the method further includes the following steps:
step S3.1, sorting the dimensions of the plurality of dimension subsets divided in the step S2 according to the greedy algorithm idea;
and S3.2, constructing a parallel coordinate graph according to the dimension sequence, and coloring the broken line to increase the definition of the visualization result.
10. The Isomap algorithm layout-based low-dimensional parallel coordinates graph construction method according to claim 9, wherein: in the step S3.1, for the multiple dimension subsets divided in the step S2, the method of sorting the dimensions according to the greedy algorithm idea is as follows: for the divided dimension subsets, firstly, according to the result of the distance matrix, selecting a pair of dimensions with the minimum distance from the dimension subsets, and sequentially arranging two coordinate axes corresponding to the pair of dimensions at the left first and the left second positions of the parallel coordinate graph; secondly, taking the left two coordinate axes as a reference, selecting a dimension closest to the dimension point corresponding to the reference coordinate axis from the rest non-arranged dimensions, and arranging the coordinate axes at the left three positions; by analogy, until all dimensions in the subset set are arranged completely, the finally obtained dimension sequence is the arrangement sequence of all coordinate axes after the low-dimensional parallel coordinate graph is constructed;
step S3.2 is to construct a parallel coordinate graph according to the dimension sequence, and perform polyline coloring to increase the definition of the visualization result: sequencing coordinate axes of a low-dimensional parallel coordinate graph to be constructed according to a dimension sequence, and drawing the parallel coordinate graph according to dimension data; meanwhile, the broken line color of the coordinate graph is defined according to the category selected by the user; if the data set does not have any classification dimension, in this case, the data samples are divided into groups using a plurality of clustering methods, and colors are assigned accordingly, wherein the number of groups is specified by the user.
CN202010279193.4A 2020-04-10 2020-04-10 Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout Pending CN111488502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010279193.4A CN111488502A (en) 2020-04-10 2020-04-10 Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010279193.4A CN111488502A (en) 2020-04-10 2020-04-10 Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout

Publications (1)

Publication Number Publication Date
CN111488502A true CN111488502A (en) 2020-08-04

Family

ID=71812650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010279193.4A Pending CN111488502A (en) 2020-04-10 2020-04-10 Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout

Country Status (1)

Country Link
CN (1) CN111488502A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554091A (en) * 2021-07-21 2021-10-26 长江存储科技有限责任公司 Method, apparatus, system, and storage medium for decomposing layout of semiconductor structure

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103354928A (en) * 2012-02-03 2013-10-16 日本电气株式会社 Device, method, and program for visualization of multi-dimensional data
CN109558888A (en) * 2017-09-27 2019-04-02 武汉嫦娥信息科技有限公司 A kind of parallelization accelerating algorithm of Classification of hyperspectral remote sensing image
CN110378272A (en) * 2019-07-12 2019-10-25 河海大学 Target in hyperspectral remotely sensed image feature extracting method based on partitioning of matrix Isomap algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103354928A (en) * 2012-02-03 2013-10-16 日本电气株式会社 Device, method, and program for visualization of multi-dimensional data
US20170032017A1 (en) * 2012-02-03 2017-02-02 Nec Corporation Multidimensional data visualization apparatus, method, and program
CN109558888A (en) * 2017-09-27 2019-04-02 武汉嫦娥信息科技有限公司 A kind of parallelization accelerating algorithm of Classification of hyperspectral remote sensing image
CN110378272A (en) * 2019-07-12 2019-10-25 河海大学 Target in hyperspectral remotely sensed image feature extracting method based on partitioning of matrix Isomap algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IK SOO LIM,PABLO DE HERAS CIECHOMSKI,ET.AL.: "Planar arrangement of high-dimensional biomedical data sets by isomap coordinates" *
石浩: "基于等距特征映射的非线性降维及其应用研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554091A (en) * 2021-07-21 2021-10-26 长江存储科技有限责任公司 Method, apparatus, system, and storage medium for decomposing layout of semiconductor structure
CN113554091B (en) * 2021-07-21 2022-12-09 长江存储科技有限责任公司 Method, apparatus, system, and storage medium for decomposing layout of semiconductor structure

Similar Documents

Publication Publication Date Title
Lan et al. Generative adversarial networks and its applications in biomedical informatics
Javed et al. Cellular community detection for tissue phenotyping in colorectal cancer histology images
JP6993371B2 (en) Computed tomography lung nodule detection method based on deep learning
Shamrat et al. LungNet22: a fine-tuned model for multiclass classification and prediction of lung disease using X-ray images
CN110517253B (en) Method for classifying benign and malignant pulmonary nodules based on 3D multi-target feature learning
CN110060263A (en) Dividing method, segmenting device, segmenting system and the computer-readable medium of medical image
US20070036434A1 (en) Topology-Based Method of Partition, Analysis, and Simplification of Dynamical Images and its Applications
Yao et al. Pneumonia detection using an improved algorithm based on faster r-cnn
Fati et al. Early diagnosis of oral squamous cell carcinoma based on histopathological images using deep and hybrid learning approaches
Khoshdeli et al. Deep learning models differentiate tumor grades from H&E stained histology sections
Aldhyani et al. Deep Learning Model for the Detection of Real Time Breast Cancer Images Using Improved Dilation-Based Method
Donovan-Maiye et al. A deep generative model of 3D single-cell organization
Lan et al. Multi-view convolutional neural network with leader and long-tail particle swarm optimizer for enhancing heart disease and breast cancer detection
CN115984622A (en) Classification method based on multi-mode and multi-example learning, prediction method and related device
CN114494199A (en) Liver CT tumor segmentation and classification method based on deep learning
Souaidi et al. A multiscale polyp detection approach for gi tract images based on improved densenet and single-shot multibox detector
CN111488502A (en) Low-dimensional parallel coordinate graph construction method based on Isomap algorithm layout
CN113362360B (en) Ultrasonic carotid plaque segmentation method based on fluid velocity field
Hoang et al. A deep learning method for 3D object classification and retrieval using the global point signature plus and deep wide residual network
Waqas et al. Multimodal data integration for oncology in the era of deep neural networks: a review
Andrade-Miranda et al. Multi-modal medical Transformers: A meta-analysis for medical image segmentation in oncology
CN112420170B (en) Method for improving image classification accuracy of computer aided diagnosis system
CN117058170A (en) Carotid plaque segmentation method based on double-branch multi-scale cross fusion network
Dittimi et al. Mobile phone based ensemble classification of deep learned feature for medical image analysis
Qiao et al. An end-to-end pipeline for early diagnosis of acute promyelocytic leukemia based on a compact CNN model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200804

RJ01 Rejection of invention patent application after publication