CN113537281B

CN113537281B - Dimension reduction method for performing visual comparison on multiple high-dimension data

Info

Publication number: CN113537281B
Application number: CN202110576652.XA
Authority: CN
Inventors: 汪云海; 孙国霞; 王银桥; 陈路; 卢金禹; 华博
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2024-03-19
Anticipated expiration: 2041-05-26
Also published as: CN113537281A

Abstract

The invention provides a dimension reduction method for performing visual comparison on a plurality of high-dimension data sets, which is used for receiving two high-dimension data sets to be processed; computing edge similarity for the two datasets; performing dimension reduction processing on the first data set by using a t distribution-random neighbor embedding method; based on the edge similarity, introducing edge vector constraint into an optimization equation of a t distribution-random neighbor embedding method, and obtaining a dimension reduction result of the second data set through solving optimization. The invention can realize the consistency dimension reduction result suitable for the comparison task.

Description

Dimension reduction method for performing visual comparison on multiple high-dimension data

Technical Field

The invention belongs to the technical field of data visualization, and particularly relates to a dimension reduction method for performing visual comparison on a plurality of high-dimension data.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Dimension reduction is a process of mapping high-dimensional data into a perceptibly low-dimensional space and maintaining as much as possible the correlation of data points in the original space. The dimension reduction can reveal the underlying distribution and topology of high-dimensional data, so that human analysis and interpretation become possible, and therefore, the dimension reduction method is widely applied to a plurality of fields such as data mining, machine learning, bioinformatics and the like. Common dimension reduction methods include t-distribution-random neighbor embedding (t-SNE), principal Component Analysis (PCA), multidimensional scaling (MDS), and the like.

The dimension reduction can be compared as an extension of conventional dimension reduction for processing a series of dynamic high-dimensional datasets. Such as comparing the outputs of different layers of the deep neural network. The simplest way is to reduce the dimension for each data individually. However, due to the randomness and unpredictable optimization process of many dimension reduction methods, this approach often introduces undesirable variations, such as shifts in the location of the same data point between different frames. Thus, a general goal of comparable dimension reduction is to maintain the fidelity of the dimension reduction while achieving visual consistency of the sequence dimension reduction results.

Existing methods for comparable dimension reduction can be divided into the following two types according to the type of data change:

incremental dimension reduction methods, in which there is one incremental or additive update of the data in each time frame, the previous point is typically maintained in a static position. Such as incremental principal component analysis (incremental PCA), by finding the optimal overlap of common data points in two adjacent dimension-reduction results, then using a position estimation algorithm to support the addition of non-uniform dimension data points to the data.

The time-varying dimension reduction method varies the characteristics of the data points between different time frames without the number of data points varying. Dynamic t-distribution-random nearest neighbor embedding (Dynamic t-SNE) introduces an additional loss function term on the basis of t-SNE, which acts to penalize the movement of the position of each data point in different dimension reduction results. While this approach achieves visual consistency, the strict constraints on the absolute position of each point easily lead to distortion of the dimension reduction result. In addition, dynamic t-SNE is optimized together with a series of data sets received at one time, causing a great computational burden, which is also a challenge to hardware and thus unsuitable for the dimension reduction of streaming data.

Disclosure of Invention

In order to solve the problems, the invention provides a dimension reduction method for performing visual comparison on a plurality of high-dimensional data.

According to some embodiments, the present invention employs the following technical solutions:

a dimension reduction method for visually comparing a plurality of high-dimensional data, comprising the steps of:

receiving two high-dimensional data sets to be processed;

computing edge similarity for the two datasets;

performing dimension reduction processing on the first data set by using a t distribution-random neighbor embedding method;

based on the edge similarity, introducing edge vector constraint into an optimization equation of a t distribution-random neighbor embedding method, and obtaining a dimension reduction result of the second data set through solving optimization.

In an alternative embodiment, the process of computing edge similarity for two data sets includes:

receiving two input high-dimensional data sets, and respectively constructing k neighbor graphs by using KD trees;

searching all the primitives containing the node in the k neighbor graph, and taking normalized primitive frequency distribution as the characteristic vector of the node;

and calculating the similarity of the common edges in the k-nearest neighbor graphs of two adjacent time frames based on the feature vector of each node.

As a further limitation, the specific procedure of taking the normalized primitive frequency distribution as the eigenvector of the node: for all nodes in the two k-nearest neighbor graphs, the frequency distribution of all primitives containing them is counted separately.

As a further limitation, the specific process of calculating the similarity of the common edges in the k-nearest neighbor graphs of two adjacent time frames based on the feature vector of each node includes: the vertex similarity between corresponding nodes is calculated based on the frequency distribution of all primitives including the nodes, and then the similarity of two edges in two k-nearest neighbor graphs is calculated based on the vertex similarity.

As an alternative embodiment, the specific process of introducing the edge vector constraint into the optimization equation of the t-distribution-random neighbor embedding method includes: and establishing a unified energy optimization equation for the coordinates of the current frame dimension reduction space, the coordinates in the current frame high-dimension space and the coordinates of the previous frame dimension reduction space based on the similarity.

A dimension reduction system for visually comparing a plurality of high-dimensional data, comprising:

the first dimension reduction module is configured to receive two high-dimension data sets to be processed, and the dimension reduction processing is carried out on the first data set by using a t distribution-random neighbor embedding method;

a similarity calculation module configured to calculate edge similarity for the two data sets based on vertex similarity;

the second dimension reduction module is configured to introduce edge vector constraint into an optimization equation of the t distribution-random neighbor embedding method, and obtain a dimension reduction result of the second data set through solving optimization.

As an alternative implementation mode, the method further comprises a visualization module, wherein the visualization module is configured to obtain a visualization result according to the optimized dimension reduction position and map the optimized dimension reduction position to class labels of the data points by using a color table selected by a user in advance.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the steps of one of the above-described dimension reduction methods of visually comparing a plurality of high-dimensional data.

A terminal device comprising a processor and a computer readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of one of the above-described dimension reduction methods for visually comparing a plurality of high-dimensional data.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a comparable dimension reduction method aiming at high-dimension dynamic data. By introducing similarity metrics and vector constraints based on primitive kernel functions, a consistent and realistic dimension reduction result can be generated for multiple datasets. The method solves the defect that the prior method adds global constraint and can not reflect the real local change of high-dimensional data, is easier for a user to analyze, and has wide application prospect in the field of data visualization.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a method for computing structural similarity based on primitive kernel functions according to an embodiment;

fig. 2 (a) shows the dimension reduction result of the high-dimension time series data corresponding to λ=0.01;

fig. 2 (b) shows the dimension reduction result of the high-dimension time series data corresponding to λ=0.05;

fig. 2 (c) shows the dimension reduction result of the high-dimension time series data corresponding to λ=0.1;

FIG. 3 (a) is a graph showing the dimension reduction result of artificial high-dimension time series data of a randomly initialized t-SNE;

FIG. 3 (b) is a graph of the artificial high-dimensional time series data dimension reduction results for the same initialized t-SNE;

FIG. 3 (c) is a graph showing the dimension reduction result of the artificial high-dimension time series data of Dynamic t-SNE;

FIG. 3 (d) is a dimension reduction result of the artificial high-dimension time series data in the present embodiment;

FIG. 4 (a) is a real high-dimensional time series data dimension reduction result of a randomly initialized t-SNE;

FIG. 4 (b) is a true high-dimensional time series data dimension reduction result of the same initialized t-SNE;

FIG. 4 (c) shows the real high-dimensional time series data dimension reduction result of Dynamic t-SNE;

FIG. 4 (d) is a real high-dimensional time-series data dimension reduction result of the present embodiment;

fig. 5 is a schematic flow chart of the first embodiment.

The specific embodiment is as follows:

the invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

receiving two high-dimensional data sets to be processed;

computing edge similarity for the two datasets;

The calculating of the edge similarity is based on a primitive kernel function, as shown in fig. 1, and the method for calculating the edge similarity based on the primitive kernel function includes:

step 1: receiving two high-dimensional data X of input ₀ And X ₁ The method comprises the steps of carrying out a first treatment on the surface of the Constructing k-nearest neighbor graphs G using KD-trees, respectively ₀ And G ₁ 。

Step 2: for G ₀ And G ₁ The ith node in (a)And->The frequency distribution of all primitives containing them, denoted fv, is counted separately _i ⁰ ，fv _i ¹ 。

Step 3: calculation of v _i ⁰ And v _i ¹ Vertex similarity between:

wherein,representation->And->Is the proportion of the common k-nearest neighbor,/>Representation->And->Cosine similarity of (c).

Step 4: calculation G ₀ Edges in (a)And G ¹ Side->Similarity of (3):

the process of the t-SNE of the edge vector constraint comprises the following steps:

step 1: receiving two high-dimensional data X of input ₀ And X ₁ ；

Step 2: pair X using t-SNE ₀ Dimension reduction is carried out to obtain dimension reduction result Y ₀ . the optimization equation for t-SNE is as follows:

wherein,is X ₀ I th point->And j' th point>Symmetric joint probability of->Is Y ₀ I th point->And j' th point>Is used to determine the joint probability of (1).

Step 3: adding vector constraints to the optimization equation of t-SNE for X ₁ Dimension reduction is carried out to obtain dimension reduction result Y ₁ . The definition is as follows:

wherein the method comprises the steps ofRepresents G ₀ And G ₁ Middle common edge e ⁰ _ij And e ¹ _ij N represents the number of common edges, λ represents the weight of the artificially set vector constraintHeavy. />Represents Y ₀ I data point of>Represents Y ₁ I data point of (b). />Represents Y ₀ In j data points>Represents Y ₁ In j data points>Representation->And->Symmetric joint probability of->Representation->Andis used to determine the joint probability of (1).

This embodiment uses a gradient descent algorithm to solve this equation. By solving the optimization equation, the reality of the single-frame dimension reduction result and the consistency among different frame dimension reduction results can be balanced to the greatest extent.

The parameter lambda is the weight of vector constraint, fig. 2 (a), fig. 2 (b), and fig. 2 (c) illustrates the optimization result at different lambda;

fig. 3 (a), 3 (b), 3 (c) and 3 (d) show application scenarios of the dimension reduction method on artificial data, where there are four time frames in total. At t=0, 500 data points are generated from five isotropic gaussian distributions in 100-dimensional space, with the center of each distribution randomly chosen among the standard basis vectors, and the variance of each distribution is 0.05. The resulting dataset is noted as t=0. At t=1, all points in the first cluster are shifted +0.15 in each dimension. At t=2, we split the second cluster in half, with half moving +0.15 in all dimensions and the other half moving-0.15 in all dimensions. At t=3, we overlap the centers of the third and fourth clusters. In this example, the comparison results of the same initialized t-SNE and the randomly initialized t-SNE are shown in FIG. 3 (a), FIG. 3 (b), FIG. 3 (c) and FIG. 3 (d). For ease of comparison, we used the dimension reduction result of Dynamic t-SNE as the first frame of the four methods at t=0.

It can be seen that the results of the randomly initialized t-SNEs of FIG. 3 (a) at different time frames are not aligned; FIG. 3 (b) the same initialized t-SNE provides better visual consistency compared to the same initialized t-SNE, but still has clusters unnecessarily shifted; the third and fourth clusters at t=3 should be completely coincident together, but merely adjacent together due to absolute position constraints in the dimension reduction result of Dynamic t-SNE of fig. 3 (c); in contrast, the present embodiment of fig. 3 (d) is capable of generating a dimension reduction result that satisfies consistency and is true and reliable.

Fig. 4 (a), 4 (b), 4 (c), and 4 (d) show the application scenario of the present embodiment on the data set on the convolutional neural network VGG-16, where the raw data is 700 images from the ImageNet data set, covering ten categories of tiger cat, cat plaque, giant chenille, standard chenille, big white shark, tiger shark, canary, sparrow, traveling car, and military. The image is input into a pretrained VGG-16 network, and the output characteristic vector of the last four layers of the network is taken as high-dimensional time sequence data of four time frames.

Four dimension reduction methods were used to compare the data. Similarly, we use the dimension reduction result of Dynamic t-SNE as the first frame of the four methods at t=0. The randomly initialized t-SNE of FIG. 4 (a) and the identically initialized t-SNE of FIG. 4 (b) produce the most authentic results, but fail to maintain consistency. FIG. 4 (c) Dynamic t-SNE produces results that are too stiff to reflect dramatic changes in topology. Fig. 4 (d) this embodiment is more robust to drastic changes while exhibiting more realistic dimension reduction results.

Embodiment two:

the embodiment also provides a dimension reduction system for visually comparing a plurality of high-dimension data, which comprises:

an input module configured to receive the input high-dimensional time series data, calculate a k-nearest neighbor map for each frame of data;

a similarity module configured to calculate a topological structure similarity between corresponding nodes of k-nearest neighbor graphs of adjacent frames using a primitive kernel function; and calculating the edge similarity between the corresponding public edges;

the system comprises an energy optimization equation establishment module, a storage module and a storage module, wherein the energy optimization equation establishment module is configured to establish a unified energy optimization equation for the coordinates of a current frame dimension reduction space and the coordinates in a current frame high-dimension space and the coordinates of a previous frame dimension reduction space based on similarity; in this embodiment, the dimension-reducing space is generally a two-dimensional space;

and the visualization module is configured to solve the energy optimization equation to obtain an optimized dimension reduction position, and map the dimension reduction position to class labels of the data points by using a color table provided by a user to obtain a final visualization result.

Example III

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the steps of the method provided by the above embodiments.

Example IV

A terminal device comprising a processor and a computer readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the methods provided by the above-described embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A dimension reduction method for visually comparing a plurality of high-dimension data is characterized by comprising the following steps: the method comprises the following steps:

receiving two high-dimensional data sets to be processed;

computing edge similarity for the two datasets; specifically, two input high-dimensional data sets are received, and k neighbor graphs are respectively constructed by using KD trees; searching all the primitives containing the node in the k neighbor graph, and taking normalized primitive frequency distribution as the characteristic vector of the node; calculating the similarity of the public edges in the k-nearest neighbor graphs of two adjacent time frames based on the feature vector of each node; the specific process of taking the normalized primitive frequency distribution as the characteristic vector of the node is to count the frequency distribution of all primitives containing the nodes in the two k-nearest neighbor graphs respectively;

based on the edge similarity, introducing edge vector constraint into an optimization equation of a t distribution-random neighbor embedding method, and obtaining a dimension reduction result of a second data set by solving and optimizing, wherein the dimension reduction result is specifically as follows: solving an energy optimization equation to obtain an optimized dimension reduction position, and mapping a color table provided by a user to class labels of data points to obtain a final visual result;

the method is applied to a data set of a convolutional neural network VGG-16, original data come from 700 images in the image Net data set, ten categories including tiger cat, spot cat, giant Xuenarui dog, standard Xuenarui dog, big white shark, tiger shark, canary, sparrow, traveling automobile and military uniform are covered, the images are input into the pretrained VGG-16 network, and output feature vectors of the last four layers of the network are taken as high-dimensional time sequence data of four time frames.

2. A dimension reduction method for visual comparison of a plurality of high-dimensional data as defined in claim 1, wherein: the vertex similarity between corresponding nodes is calculated based on the frequency distribution of all primitives including the nodes, and then the similarity of two edges in two k-nearest neighbor graphs is calculated based on the vertex similarity.

3. A dimension reduction method for visual comparison of a plurality of high-dimensional data as defined in claim 1, wherein: the specific process of introducing the edge vector constraint into the optimization equation of the t-distribution-random neighbor embedding method comprises the following steps: and establishing a unified energy optimization equation for the coordinates of the current frame dimension reduction space, the coordinates in the current frame high-dimension space and the coordinates of the previous frame dimension reduction space based on the similarity.

4. A dimension reduction method for visual comparison of a plurality of high-dimensional data as defined in claim 1, wherein: and when the solution is optimized, the gradient descent algorithm is utilized for solving.

5. A dimension reduction system for visually comparing a plurality of high-dimension data is characterized in that: comprising the following steps:

a similarity calculation module configured to calculate edge similarity for the two data sets based on vertex similarity; specifically, two input high-dimensional data sets are received, and k neighbor graphs are respectively constructed by using KD trees; searching all the primitives containing the node in the k neighbor graph, and taking normalized primitive frequency distribution as the characteristic vector of the node; calculating the similarity of the public edges in the k-nearest neighbor graphs of two adjacent time frames based on the feature vector of each node; the specific process of taking the normalized primitive frequency distribution as the characteristic vector of the node is to count the frequency distribution of all primitives containing the nodes in the two k-nearest neighbor graphs respectively;

the second dimension reduction module is configured to introduce edge vector constraint into an optimization equation of the t distribution-random neighbor embedding method, and obtains a dimension reduction result of the second data set by solving and optimizing, specifically: solving an energy optimization equation to obtain an optimized dimension reduction position, and mapping a color table provided by a user to class labels of data points to obtain a final visual result;

6. A dimension reduction system for visual comparison of a plurality of high-dimensional data as defined in claim 5, wherein: the system further comprises a visualization module which is configured to obtain a visualization result according to the optimized dimension reduction position and mapping the optimized dimension reduction position to class labels of the data points by utilizing a color table selected by a user in advance.

7. A computer-readable storage medium, characterized by: in which a plurality of instructions are stored, said instructions being adapted to be loaded by a processor of a terminal device and to perform the steps of a method of dimension reduction for visual comparison of a plurality of high-dimensional data according to any of claims 1-4.

8. A terminal device, characterized by: comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of a method of dimension reduction for visual comparison of a plurality of high-dimensional data as claimed in any one of claims 1 to 4.