CN112417633A - Large-scale network-oriented graph layout method and device - Google Patents

Large-scale network-oriented graph layout method and device Download PDF

Info

Publication number
CN112417633A
CN112417633A CN202011384170.6A CN202011384170A CN112417633A CN 112417633 A CN112417633 A CN 112417633A CN 202011384170 A CN202011384170 A CN 202011384170A CN 112417633 A CN112417633 A CN 112417633A
Authority
CN
China
Prior art keywords
node
graph
objective function
distribution
layout
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011384170.6A
Other languages
Chinese (zh)
Other versions
CN112417633B (en
Inventor
魏迎梅
韩贝贝
窦锦身
康来
谢毓湘
蒋杰
杨雨璇
万珊珊
冯素茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011384170.6A priority Critical patent/CN112417633B/en
Publication of CN112417633A publication Critical patent/CN112417633A/en
Application granted granted Critical
Publication of CN112417633B publication Critical patent/CN112417633B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/18Network design, e.g. design based on topological or interconnect aspects of utility systems, piping, heating ventilation air conditioning [HVAC] or cabling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/02CAD in a network environment, e.g. collaborative CAD or distributed simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a large-scale network-oriented graph layout method, which comprises the following steps: representing each node in the graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, and constructing an embedded matrix of the graph data; and projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space. The invention also discloses a large-scale network-oriented graph layout device. The invention has higher computational efficiency and less required storage space, can keep the local and global structural characteristics of the graph data, and can make the nodes with higher degree values in the graph data relatively dispersed with the neighbor nodes under the condition of keeping the local structural information, thereby effectively relieving the possible crowding or overlapping phenomena.

Description

Large-scale network-oriented graph layout method and device
Technical Field
The invention belongs to the field of network data processing, and particularly relates to a large-scale network-oriented graph layout method and device.
Background
In the face of increasingly large-scale data, graph visualization has become an important network data analysis method, and plays an important role in many application fields, such as biomedical networks, chemical molecular networks, traffic networks, financial transaction networks, academic collaboration networks, social networks, and the like. The graph visualization is composed of a graph layout, a network attribute expression and reasonable user interaction, and the most core element is the graph layout, and the network attribute expression and the reasonable user interaction are based on the premise of good graph layout, so that one of the main research contents in the graph visualization field is the graph layout.
The graph layout method mainly classifies two types: force-guided based graph layouts and data-dimension-reduction based graph layouts. The graph layout based on force guidance is realized by modeling a graph as a physical system, wherein nodes in the graph represent steel rings, connecting edges between the nodes are analogized to springs, and the attractive force and the repulsive force in the spring system are simulated. Firstly, a random initial state is set for the system, the nodes in the system continuously iterate and update the positions of the nodes under the interaction of attractive force and repulsive force, and iteration is stopped until the nodes in the whole system are stressed in a balanced manner, and at the moment, the system reaches a stable state. The data dimension reduction-based graph layout keeps the similarity of node distribution in a graph space and node distribution in a two-dimensional layout space by minimizing an objective function, so that the distribution of the two-dimensional layout is as close to the distribution in the graph space as possible, and the structure and attribute information of original graph data is strived to be kept and reflected.
The force-guided layout algorithm is simple and easy to implement, however, the method can only achieve local optimization, the distance between a point and a middle point in a layout space can well represent the local structure between corresponding nodes in a graph space, and the relationship between the local and local in graph data, namely global structure information, is difficult to capture. The method has the main idea that the data is reduced from a high-dimensional graph space to a two-dimensional layout space, and the proximity between node pairs in an original graph space can be kept as much as possible in the layout space, so that the overall deviation of the layout embedded in the two-dimensional space is minimum. The dimension reduction technology is divided into a linear dimension reduction technology and a nonlinear dimension reduction technology, while the traditional linear dimension reduction method such as multidimensional scale analysis can effectively analyze data structure information in a high-dimensional space in a two-dimensional space, but for nonlinear structure data, the structural relationship between data points cannot be effectively expressed through linearity, so that a nonlinear dimension reduction algorithm is derived.
The existing nonlinear dimension reduction technologies comprise the following steps:
1) tsNET algorithm. tsNET is an improved objective function on the basis of a t-SNE nonlinear dimension reduction method, a new optimized objective function is constructed on the basis of KL divergence of difference between measurement map space node distribution and layout space node distribution by combining a compression term and a repulsion term, a good map layout result is obtained by continuously iteratively optimizing the objective function, and the crowding phenomenon between nodes is relieved.
2) tsNET algorithm. the tsNET algorithm initializes the coordinate values of each node in the layout space by adopting the PivotMDS method on the basis of the tsNET method, and although the complexity of the algorithm is increased, the layout quality of the graph is greatly improved compared with the tsNET method.
tsNET and tsNET methods are more flexible than traditional force-directed placement methods. However, the algorithm first calculates the shortest path distance of graph theory between node pairs to obtain the shortest path distance matrix with the size of | V | × | V |. Each i row in the matrix represents the shortest path distance from the ith node to the rest of the nodes in the graph data, and is represented by a vector with length | V |. Then, based on the matrix, the similarity between node pairs is calculated in a conditional probability mode, the method needs to sequentially traverse the distance between the node pairs, and the calculation complexity is proportional to the quadratic power of the number of the graph data nodes, namely O (| V | N2). The shortest path distance matrix is stored with the same complexity and proportional to the quadratic power of the number of nodes, i.e. O (| V2). For graph data with a very large number of nodes, the spatial and temporal complexity of obtaining the matrix is unacceptable.
In summary, the graph layout technology based on nonlinear dimension reduction mainly has the following problems: first, when calculating the similarity between nodes in graph data, it is necessary to construct a Shortest Path Distance Matrix (SPDM) of the graph data by using the Shortest Path distance in graph theory, where the size of the Matrix is | V | × | V |, and store the Matrix for subsequent use. If no path exists between two node pairs, the path distance is set to a positive infinite number. The time and space complexity required by the process is the quadratic power of the number of nodes, namely O (| V2). For large scale network topologies, this temporal and spatial complexity is intolerable. Secondly, when an objective function is defined by the existing graph layout technology based on nonlinear dimension reduction, the centrality difference of different nodes is not considered, the number of nodes with higher values is large, and the nodes may overlap in a two-dimensional layout space.
Disclosure of Invention
The invention provides a method and a device for diagram layout facing a large-scale network, aiming at overcoming the defects of the existing diagram layout technology based on nonlinear dimension reduction.
The technical scheme adopted by the invention is as follows:
in a first aspect, a method for graph layout for large-scale networks includes:
representing each node in graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, and constructing an embedded matrix of the graph data;
and projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.
Preferably, the constructing an embedded matrix of the graph data by representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model includes:
for one node, acquiring the neighbors of the node through a random walk sampling strategy to obtain a walk sequence;
dividing the walking sequence through a window to obtain a training sample sequence;
constructing a first objective function;
and inputting the training samples into a Skip-Gram model, optimizing the first objective function by a random gradient descent method, and learning to obtain a low-dimensional dense vector of the node.
Preferably, the constructing the first objective function includes:
setting an initial objective function, wherein the initial objective function is as follows:
Figure BDA0002810572770000031
wherein, f is V → RkFor the mapping from node to k dimension, k is a predetermined parameter, and 2 < k < | V |, f is an embedding matrix with a size of | V | × k, | V | is a node set size of (V, E) of graph data G, P is a node set size of (V, E)r(Ns(u) | f (u)) is a conditional probability;
when conditional independence is present, the conditional probability is:
Figure BDA0002810572770000032
when the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is:
Figure BDA0002810572770000033
obtaining a first objective function according to the initial objective function and the conditional probability, wherein the first objective function is as follows:
Figure BDA0002810572770000034
preferably, the obtaining a map layout result of the map data in the two-dimensional space by projecting the embedded matrix through an improved nonlinear dimension reduction algorithm includes:
according to the embedding matrix, adopting conditional probability measurement to measure the similarity between two nodes represented by the low-dimensional dense vector, and obtaining P distribution describing the node similarity;
measuring the proximity between two nodes in the two-dimensional space through Student-t distribution to obtain Q distribution for describing the node proximity;
measure the difference between the P distribution and the Q distribution by KL divergence;
constructing a second objective function according to the difference between the P distribution and the Q distribution and the edge repulsive force strategy of node centrality;
and optimizing the second objective function to obtain the coordinate value of the node in the graph data in the two-dimensional space.
Preferably, the node centrality edge repulsion strategy is that a measure node centrality index is added in front of a repulsion term of an objective function for constraint; the node centrality index is a value of the node.
Preferably, the P distribution is:
Figure BDA0002810572770000041
Figure BDA0002810572770000042
wherein, | | xi-xjI is a node x represented by the low-dimensional dense vectoriAnd node xjDistance between, δiIs node xiIs the variance of the gaussian distribution of the center point;
the Q distribution is:
Figure BDA0002810572770000043
wherein, | | yi-yj| | represents a node y in the two-dimensional spaceiAnd node yjDistance in Euclidean layout space;
the KL divergence between the P and Q distributions is:
Figure BDA0002810572770000044
preferably, the second objective function is:
Figure BDA0002810572770000045
wherein the first term argmin lambdaKLCKLTo measure KL divergence of P and Q distribution differences, the second term
Figure BDA0002810572770000046
Is the early compression term, the third term
Figure BDA0002810572770000047
Is a bagContaining weight constraint MijTerm of repulsion,. epsilonrFor preventing yi-yjParameter of singular point appearing when | | is approximately equal to 0, λKL、λcAnd λrAre parameters that weigh the importance of these three terms to the second objective function.
Preferably, after the obtaining a graph layout result of the graph data in a two-dimensional space by projecting the embedded matrix through a modified nonlinear dimension reduction algorithm, the method includes:
measuring the layout result of the graph according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
In a second aspect, a large-scale network-oriented graph layout apparatus includes:
the learning module is used for representing each node in graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data;
and the projection module is used for projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.
Preferably, the map layout apparatus for a large-scale network further includes:
the aesthetic measurement module is used for measuring the graph layout result according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
By adopting the technical scheme, the invention has the following beneficial effects:
1) the network embedded representation model based on machine learning performs representation learning of the nodes to obtain low-dimensional dense vector representation of each node, and the vectors are combined to obtain an embedded matrix, so that the calculation efficiency is higher, the required storage space is less, and the local and global structural features of graph data can be maintained.
2) The embedded matrix is projected based on an improved nonlinear dimension reduction algorithm to obtain the graph layout result of the graph data in a two-dimensional space, particularly for the nodes with higher degree values in the graph data, the nodes can be relatively dispersed with the neighbor nodes under the condition of keeping local structure information, and the possible crowding or overlapping phenomenon can be effectively relieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a large-scale network-oriented graph layout method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the step S10 of the large-scale network-oriented graph layout method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of breadth-first search and depth-first search strategies in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a sampling strategy for random walk according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating the step S20 of the large-scale network-oriented graph layout method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a large-scale network-oriented graph layout method according to another embodiment of the present invention;
fig. 7 is a schematic structural diagram of a large-scale network-oriented graph layout apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In an embodiment of the present invention, as shown in fig. 1, an embodiment of the present invention provides a large-scale network-oriented graph layout method, including:
and step S10, representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data.
In the present embodiment, each node V in the original graph data G ═ (V, E) is represented by a machine learning-based network-embedded representation modeliUsing low-dimensional dense vectors xi∈RkWhere 2 < k < | V |, the low-dimensional dense vector xiBeing able to characterize the node viAnd the structural similarity relationship between the nodes and the low-order and high-order neighbor nodes thereof, the low-dimensional dense vector X of the | V | nodes is { X ═ X1,x2,...,x|V|And forming an embedding matrix with the size of | V | × k, wherein the embedding matrix can not only represent the local structural relationship in the graph data, but also describe the structural features between the local and local parts in the graph data, namely the global structural information. The ith row, embedded in the matrix, represents the current node i captured with its first and higher order neighborsAnd (4) structural relationship.
It should be noted that, in step S10, although the coordinate values of the nodes in the two-dimensional space can be obtained by embedding a representation model in the network based on machine learning and representing each node in the graph data by a vector with a size of 2, that is, k is 2, in this case, the structural features of the node and its neighboring nodes cannot be well captured, and information loss is relatively large, which affects the layout quality of the graph layout result and subsequent data analysis and mining. Therefore, it is necessary to perform step S20 to obtain a final graph layout result further based on the embedded matrix obtained in step S10.
And step S20, projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.
First, based on an original node V ═ V1,v2,...,v|V|The low-dimensional dense vector X ═ X1,x2,...,x|V|Expressing two nodes x in the form of conditional probabilityiAnd xjThe similarity between them, the probability distribution P is obtained. Similarly, two nodes y in two-dimensional spaceiAnd yjThe proximity between them is also expressed in the form of a probability distribution Q.
Then, when the graph layout is carried out, an edge repulsion strategy based on node centrality is adopted, namely a constraint for measuring the node centrality index is added in front of the repulsion item of the objective function. The node centrality index may be a value of a node, that is, the number of connected edges of the node. The edge repulsion force strategy based on the node centrality can distribute different repulsion forces according to the central characteristics of different nodes. Particularly, for a node with a large degree value in graph data, the node can be dispersed from a neighbor node under the condition of keeping the local structural feature of the node, and the phenomenon of possible congestion or overlapping with the neighbor node is relieved.
Finally, defining an objective function, combining early compression and a node centrality-based edge repulsion strategy while minimizing the difference between P distribution and Q distribution, and continuously iterating and optimizing the objective function to finally obtain the nodesCoordinate value Y of a point in two-dimensional space ═ Y1,y2,...,y|V|I.e., the layout result of the drawing data in a two-dimensional space.
In summary, in the embodiment, the network embedded representation model of machine learning is first used to perform representation learning on the nodes to obtain the low-dimensional dense vector representation of each node, and the vectors are combined to obtain the embedded matrix, so that the computation complexity is low, the required storage space is greatly reduced, and all structural information of the graph data can be maintained. And then, projecting the embedded matrix based on an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space, and particularly enabling the node with a higher degree value in the graph data to be relatively dispersed with a neighbor node under the condition of keeping local structure information, so that possible crowding or overlapping phenomena can be effectively relieved. In addition, in the embodiment, only two steps of strategies are needed to obtain a high-quality graph layout result, which is beneficial to improving the graph layout efficiency.
As still another embodiment of the present invention, as shown in fig. 2, the step S10 may include the following steps:
and S101, for one node, acquiring the neighbors of the node through a random walk sampling strategy to obtain a walk sequence.
In this embodiment, for a node u, a random walk along a network structure (connection between nodes) is adopted, and a neighbor of the node u is sampled according to a specified sampling strategy, so as to obtain a walk sequence
Figure BDA0002810572770000071
Wandering sequence NsAnd (u) is a neighbor node of the node u obtained by sampling, and the obtained neighbor nodes have difference according to different sampling strategies.
For a given graph data G ═ (V, E), if node n is present0As a starting node, the wander sequence is sampled, and the length of the wander path is denoted by l. Other nodes in the walking sequence can be generated through transition probability among the nodes, and the specific formula is as follows:
Figure BDA0002810572770000072
in the formula (1), the first and second groups,
Figure BDA0002810572770000073
is a node n0And n1B is a normalized constant. Let transition probability
Figure BDA0002810572770000074
Wherein
Figure BDA0002810572770000075
In the formula (2), p is a return parameter, q is an in-out parameter, and p and q are used to control whether the search process of random walk is according to breadth-first search (BFS) or depth-first search (DFS), as shown in the schematic diagram of breadth-first search and depth-first search strategies in fig. 3, so that the local structure and global structure information of the node can be captured. The random walk sampling strategy diagram shown in FIG. 4 assumes that the current walk is to n1Node, then the next node n is selected2Is formed by the parameters p and q and
Figure BDA0002810572770000081
it is decided that,
Figure BDA0002810572770000082
representing a node n0And n2The shortest path distance of (a) is,
Figure BDA0002810572770000083
representing the weights between the nodes. p is a return parameter and q is an in-out parameter.
And step S102, dividing the walking sequence through a window to obtain a training sample sequence. That is, a window with a size w is defined, and l-w training sample sequences can be generated by sliding the window along the walking sequence l in sequence.
Step S103, a first objective function is constructed.
First, an initial objective function is obtained, which may be defined as:
Figure BDA0002810572770000084
in formula (3), f: V → RkFor the mapping from the node to the k dimension, k is a preset parameter, and 2 < k < | V |, the final learning result f is an embedded matrix with the size of | V | × k, namely, each node in the original graph data is represented by a characteristic vector with the size of k, and the vector captures the structural relationship between the node and the first-order and high-order neighbor nodes thereof.
Secondly, assuming conditional independence, the conditional probability is expressed as:
Figure BDA0002810572770000085
assuming that the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is further expressed as:
Figure BDA0002810572770000086
and finally, obtaining a first objective function according to the initial objective function and the conditional probability.
Combining equation (5) and equation (3), the objective function is finally expressed as:
Figure BDA0002810572770000087
and S104, inputting the training samples into the Skip-Gram model, optimizing the first objective function through a random gradient descent method, and learning to obtain the low-dimensional dense vector of the node. That is, the node V ═ V in the original image data1,v2,...,v|V|Finally learnTo a low dimensional dense vector denoted X ═ X1,x2,...,x|V|}。
As still another embodiment of the present invention, as shown in fig. 5, the step S20 may include the following steps:
step S201, two nodes x represented by the low-dimensional dense vector are measured by adopting conditional probability according to the embedded matrixiAnd xjAnd obtaining the P distribution describing the similarity of the nodes according to the similarity between the nodes.
Wherein, the P distribution is specifically expressed as:
Figure BDA0002810572770000091
Figure BDA0002810572770000092
the expressions (7) and (8) mean that for two nodes xiAnd xj,xiWith conditional probability pj|iSelection of xjAs its proximity point, if two nodes have high similarity, they are selected with high probability. Conversely, a dissimilar data point will have a very low probability of being its neighbor. If two points are far apart, the conditional probability pj|iIt will be very small. The embedded matrix X ═ X can be obtained by equations (7) and (8)1,x2,...,x|V|P distribution of.
In the formula (7), | | xi-xjIs two nodes xiAnd xjDistance between, δiIs node xiIs the variance of the gaussian distribution of the center point. Since only the similarity between different nodes is considered in the present embodiment, p can be assumedii=0。
Step S202, measuring two nodes y in the two-dimensional space through Student-t distributioni、yjAnd obtaining a Q distribution for describing the node proximity.
Wherein, the Q distribution is specifically expressed as:
Figure BDA0002810572770000093
similar to the P distribution, the Q distribution makes similar nodes closer together in two-dimensional space and dissimilar nodes relatively farther apart. | | yi-yjI represents two nodes yiAnd yjDistance in the euclidean layout space. Is the same as piiSimilarly, qiiThe same is assumed to be 0.
Step S203, measure the difference between the P distribution and the Q distribution by KL divergence.
Wherein, the difference between the P distribution and the Q distribution is specifically represented as:
Figure BDA0002810572770000094
understandably, when the P distribution is approximated by the Q distribution, the amount of information lost should be kept as much as possible of the information of the original graph data, so that the smaller the difference between the P distribution and the Q distribution, the better the Q distribution is, so that the Q distributionijReflecting p as much as possibleij
And step S204, constructing a second objective function according to the edge repulsive force strategy of the node centrality.
The node with a lower value can be enabled to have a smaller repulsive force with a neighbor node according to the difference of the center characteristics of different nodes based on the edge repulsive force strategy of the node center, namely, the weight constraint for measuring the node center is added in front of the repulsive force term of the objective function; the repulsion between the node with a higher reverse value and the neighbor is larger, and under the condition of keeping the structural characteristics of the current node and the neighbor, the nodes are scattered to be opened a little, so that the phenomenon of possible overlapping or crowding can be effectively relieved.
The second objective function is represented as:
Figure BDA0002810572770000101
in formula (11), the first term argmin λKLCKLIs the KL divergence that measures the difference between the P and Q distributions, see equation (10). Second item
Figure BDA0002810572770000102
Is an early compression term that produces better results when the projection remains near the origin in the early stages of optimization, the early compression term being non-zero only in the first half of the optimization process; y isiIs a node vi(also corresponding to the representation of the low-dimensional dense vector xi) A coordinate projection in two-dimensional space; | | | | is the euclidean norm. Item III
Figure BDA0002810572770000103
To have weight constraint MijThe weight of which constrains MijThe degree value centrality of different nodes is comprehensively considered, and the phenomena of disorder and overlapping among adjacent nodes are relieved to a certain extent; epsilonrFor preventing yi-yjIf | | is approximately equal to 0, the parameter of singular point appears, if epsilon is not addedrWhen y is | |i-yjWith | ≈ 0, the node may be discarded, resulting in unstable optimization, but if εrToo small, leads to unstable optimization if εrToo large will result in a small gradient value for the third term, thereby reducing the effect of the repulsive term. Epsilon was obtained by a large number of experimentsr0.2 is a good trade-off. Lambda [ alpha ]KL、λcAnd λrFor the parameters used to weigh the importance of these three terms to the second objective function, when λcAnd λrAnd when the time is zero, the time is the target function of the t-SNE algorithm, and because the target function of the t-SNE algorithm is not constrained by an early compression item and an edge repulsion strategy, the characteristic information of the original graph data cannot be well reflected by the graph layout result, the graph layout needs to be perfected by a force guiding method.
For the third term of the second objective function, the third term of the objective function in tsNET algorithm of the present embodiment
Figure BDA0002810572770000104
Based on different central characteristics of different nodes, a weight constraint M for measuring the centrality of the nodes is added in front of the term functionij(Mij=degreeidegreej,degreeiThe value of the node i, that is, the number of the connecting edges of the node i), the repulsion term after adding the weight constraint for measuring the centrality of the node becomes:
Figure BDA0002810572770000105
in the embodiment, the product of two node degrees is used as the weight of the edge repulsive force, so that the node with a large value has a larger repulsive force term, and further the distance between the node i and the node j is slightly longer, so that the possible overlapping phenomenon can be prevented, and the strategy can be used as the edge repulsive force strategy based on the node centrality.
In other embodiments, M may be varied according to different needs and purposesijOther metrics that measure node centrality, such as affinity, feature vector centrality, and betweenness centrality, may also be used.
It should be noted that the third term of the objective function of tsNET algorithm
Figure BDA0002810572770000111
Although the phenomena of clutter and overlap which may occur between nodes in the two-dimensional layout can be alleviated to a certain extent, the term function applies the same weight constraint to all nodes in the graph data, and the targeted layout calculation cannot be performed according to different characteristics of different nodes, especially for nodes with high centrality indexes.
Step S205, optimizing the second objective function to obtain coordinate values of the nodes in the graph data in the two-dimensional space.
In this embodiment, the second objective function is iteratively optimized by using a random gradient descent method, and the iteration is stopped until the algorithm converges, so that a graph layout result of the graph data in the two-dimensional space can be obtained.
As still another embodiment of the present invention, as shown in fig. 6, the step S20 may be followed by the following steps:
step S30, measuring the layout result of the graph according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
Among them, the number of edge crossings (crossless) is one of the most important aesthetic criteria in the layout of the drawing, and the smaller the number of edge crossings, the better the number of edge crossings. The number of edge crossings can be expressed as:
Figure BDA0002810572770000112
in the formula (12), c is the number of intersections of edges, cmaxIs an approximate upper bound on the number of edge crossings, cmaxThe concrete expression is as follows:
Figure BDA0002810572770000113
in equation (13), | E | is the edge set size of the graph data G (V, E).
The Minimum angle (Minimum angle metric) is defined as the mean absolute deviation of the Minimum angle of an edge on node v from the ideal Minimum angle. The minimum angle can be expressed as:
Figure BDA0002810572770000114
in the formula (15), the ideal minimum included angle is that all edges on the node are exactly halved by 360 degrees,
Figure BDA0002810572770000115
in this embodiment, the layout quality of the graph layout result can be reliably evaluated by using the edge intersection number and the minimum angle degree graph layout result.
In addition, the embodiment of the invention also provides a large-scale network-oriented graph layout device, which can realize the large-scale network-oriented graph layout method in the embodiment. As shown in fig. 7, the map layout apparatus for large-scale network includes a learning module 110 and a projection module 120, and the detailed description of each functional module is as follows:
and the learning module 110 is used for representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data.
And a projection module 120, configured to project the embedded matrix through an improved nonlinear dimension reduction algorithm, so as to obtain a graph layout result of the graph data in a two-dimensional space.
Further, as shown in fig. 7, the learning module 110 includes a sampling sub-module 111, a dividing sub-module 112, a first constructing sub-module 113, and a learning sub-module 114, and the detailed description of each functional sub-module is as follows:
and the sampling sub-module 111 is configured to acquire, for one node, neighbors of the node through a random walk sampling strategy, and obtain a walk sequence.
And a dividing submodule 112, configured to divide the walking sequence through a window to obtain a training sample sequence.
A first construction submodule 113 for constructing a first objective function.
And the learning submodule 114 is used for inputting the training samples into the Skip-Gram model, optimizing the first objective function by a random gradient descent method, and learning to obtain a low-dimensional dense vector of the node.
Further, the first constructing sub-module 113 is further configured to set an initial objective function, where the initial objective function is:
Figure BDA0002810572770000121
wherein, f is V → RkFor the mapping from node to k dimension, k is a predetermined parameter, and 2 < k < | V |, f is an embedding matrix with a size of | V | × k, | V | is a node set size of (V, E) of graph data G, P is a node set size of (V, E)r(Ns(u) | f (u)) is a conditional probability;
when conditional independence is present, the conditional probability is:
Figure BDA0002810572770000122
when the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is:
Figure BDA0002810572770000123
obtaining a first objective function according to the initial objective function and the conditional probability, wherein the first objective function is as follows:
Figure BDA0002810572770000124
further, as shown in fig. 7, the projection module 120 includes a P distribution submodule 121, a Q distribution submodule 122, a difference metric submodule 123, a second construction submodule 124, and a projection submodule 125, and the detailed description of each functional submodule is as follows:
and the P distribution submodule 121 is configured to obtain, according to the embedded matrix, a P distribution describing similarity of the nodes by using the similarity between two nodes represented by the low-dimensional dense vector by the conditional probability metric.
And the Q distribution submodule 122 is used for measuring the proximity between two nodes in the two-dimensional space through Student-t distribution, and obtaining Q distribution used for describing the proximity of the nodes.
A difference metric sub-module 123 for measuring a difference between the P distribution and the Q distribution by a KL divergence.
And a second constructing submodule 124, configured to construct a second objective function according to the difference between the P distribution and the Q distribution and the edge repulsive force strategy of node centrality.
And the projection submodule 125 is configured to optimize the second objective function, and obtain a coordinate value of the node in the two-dimensional space in the graph data.
Further, the edge repulsive force strategy of the node centrality in the second construction sub-module 124 is to add a measure node centrality index in front of the repulsive force term of the objective function for constraint; the node centrality index is a value of the node.
Further, the P distribution in the P distribution submodule 121 is:
Figure BDA0002810572770000131
Figure BDA0002810572770000132
wherein, | | xi-xjI is a node x represented by the low-dimensional dense vectoriAnd node xjDistance between, δiIs node xiIs the variance of the gaussian distribution of the center point.
Further, the Q distribution in the Q distribution submodule 122 is:
Figure BDA0002810572770000133
wherein, | | yi-yj| | represents a node y in the two-dimensional spaceiAnd node yjDistance in the euclidean layout space.
Further, the KL divergence between the P distribution and the Q distribution in the difference metric submodule 123 is:
Figure BDA0002810572770000134
further, the second objective function in the second building submodule 124 is:
Figure BDA0002810572770000135
wherein the first term argmin lambdaKLCKLTo measure KL divergence of P and Q distribution differences, the second term
Figure BDA0002810572770000141
Is the early compression term, the third term
Figure BDA0002810572770000142
To contain a weight constraint MijTerm of repulsion,. epsilonrFor preventing yi-yjParameter of singular point appearing when | | is approximately equal to 0, λKL、λcAnd λrAre parameters that weigh the importance of these three terms to the second objective function.
Further, as shown in fig. 7, the large-scale network-oriented graph layout apparatus further includes an aesthetic measurement module 130, and the functional sub-modules are described in detail as follows:
an aesthetic measurement module 130, configured to measure the graph layout result according to a preset aesthetic index, so as to obtain layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
The graph layout device facing the large-scale network, provided by the embodiment of the invention, can represent each node in graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, construct an embedded matrix of the graph data, and then project the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space, so that the computation complexity is low, the required storage space is greatly reduced, all structural information of the graph data can be maintained, particularly, for a node with a higher degree value in the graph data, the node can be relatively dispersed with a neighbor node under the condition of maintaining local structural information, and the possible congestion or overlapping phenomenon can be effectively relieved.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A large-scale network-oriented graph layout method is characterized by comprising the following steps:
representing each node in graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, and constructing an embedded matrix of the graph data;
and projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.
2. The large-scale network-oriented graph layout method according to claim 1, wherein the constructing the embedded matrix of the graph data by representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model comprises:
for one node, acquiring the neighbors of the node through a random walk sampling strategy to obtain a walk sequence;
dividing the walking sequence through a window to obtain a training sample sequence;
constructing a first objective function;
and inputting the training samples into a Skip-Gram model, optimizing the first objective function by a random gradient descent method, and learning to obtain a low-dimensional dense vector of the node.
3. The large-scale network-oriented graph layout method according to claim 2, wherein the constructing the first objective function comprises:
setting an initial objective function, wherein the initial objective function is as follows:
Figure FDA0002810572760000011
wherein, f is V → RkFor the mapping from node to k dimension, k is a predetermined parameter, and 2 < k < | V |, f is an embedding matrix with a size of | V | × k, | V | is a node set size of (V, E) of graph data G, P is a node set size of (V, E)r(Ns(u) | f (u)) is a conditional probability;
when conditional independence is present, the conditional probability is:
Figure FDA0002810572760000012
when the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is:
Figure FDA0002810572760000013
obtaining a first objective function according to the initial objective function and the conditional probability, wherein the first objective function is as follows:
Figure FDA0002810572760000014
4. the method for the graph layout facing the large-scale network according to claim 1, wherein the obtaining the graph layout result of the graph data in the two-dimensional space by projecting the embedded matrix through a modified nonlinear dimension reduction algorithm comprises:
according to the embedding matrix, adopting conditional probability measurement to measure the similarity between two nodes represented by the low-dimensional dense vector, and obtaining P distribution describing the node similarity;
measuring the proximity between two nodes in the two-dimensional space through Student-t distribution to obtain Q distribution for describing the node proximity;
measure the difference between the P distribution and the Q distribution by KL divergence;
constructing a second objective function according to the difference between the P distribution and the Q distribution and the edge repulsive force strategy of node centrality;
and optimizing the second objective function to obtain the coordinate value of the node in the graph data in the two-dimensional space.
5. The large-scale network-oriented graph layout method according to claim 4, wherein the node centrality edge repulsion strategy is characterized in that a measure node centrality index is added in front of a repulsion term of an objective function for constraint; the node centrality index is a value of the node.
6. The large-scale network-oriented graph layout method according to claim 4, wherein the P distribution is:
Figure FDA0002810572760000021
Figure FDA0002810572760000022
wherein, | | xi-xjI is a node x represented by the low-dimensional dense vectoriAnd node xjDistance between, δiIs node xiIs the variance of the gaussian distribution of the center point;
the Q distribution is:
Figure FDA0002810572760000023
wherein, | | yi-yj| | represents a node y in the two-dimensional spaceiAnd node yjDistance in Euclidean layout space;
the KL divergence between the P and Q distributions is:
Figure FDA0002810572760000024
7. the large-scale network-oriented graph layout method according to claim 4, wherein the second objective function is:
Figure FDA0002810572760000025
wherein, the first term arg min lambdaKLCKLTo measure KL divergence of P and Q distribution differences, the second term
Figure FDA0002810572760000031
Is the early compression term, the third term
Figure FDA0002810572760000032
To contain a weight constraint MijTerm of repulsion,. epsilonrFor preventing yi-yjParameter of singular point appearing when | | is approximately equal to 0, λKL、λcAnd λrAre parameters that weigh the importance of these three terms to the second objective function.
8. The large-scale network-oriented graph layout method according to claim 1, wherein the step of projecting the embedded matrix through a modified nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space comprises:
measuring the layout result of the graph according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
9. A large-scale network-oriented graph layout apparatus, comprising:
the learning module is used for representing each node in graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data;
and the projection module is used for projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.
10. The large-scale network-oriented graph layout apparatus of claim 9, further comprising:
the aesthetic measurement module is used for measuring the graph layout result according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.
CN202011384170.6A 2020-12-01 2020-12-01 Large-scale network-oriented graph layout method and device Active CN112417633B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011384170.6A CN112417633B (en) 2020-12-01 2020-12-01 Large-scale network-oriented graph layout method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011384170.6A CN112417633B (en) 2020-12-01 2020-12-01 Large-scale network-oriented graph layout method and device

Publications (2)

Publication Number Publication Date
CN112417633A true CN112417633A (en) 2021-02-26
CN112417633B CN112417633B (en) 2022-06-14

Family

ID=74829269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011384170.6A Active CN112417633B (en) 2020-12-01 2020-12-01 Large-scale network-oriented graph layout method and device

Country Status (1)

Country Link
CN (1) CN112417633B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127697A (en) * 2021-03-30 2021-07-16 清华大学 Method and system for optimizing graph layout, electronic device and readable storage medium
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN113536663A (en) * 2021-06-17 2021-10-22 山东大学 Graph visualization method and system based on ring constraint and stress model
WO2022251178A1 (en) * 2021-05-25 2022-12-01 Visa International Service Association Systems, methods, and computer program products for generating node embeddings
WO2024098195A1 (en) * 2022-11-07 2024-05-16 华为技术有限公司 Embedding representation management method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013242622A (en) * 2012-05-17 2013-12-05 Nippon Telegr & Teleph Corp <Ntt> Graph data visualization apparatus, method and program
CN109753589A (en) * 2018-11-28 2019-05-14 中国科学院信息工程研究所 A kind of figure method for visualizing based on figure convolutional network
CN110659394A (en) * 2019-08-02 2020-01-07 中国人民大学 Recommendation method based on two-way proximity
CN110889001A (en) * 2019-11-25 2020-03-17 浙江财经大学 Big image sampling visualization method based on image representation learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013242622A (en) * 2012-05-17 2013-12-05 Nippon Telegr & Teleph Corp <Ntt> Graph data visualization apparatus, method and program
CN109753589A (en) * 2018-11-28 2019-05-14 中国科学院信息工程研究所 A kind of figure method for visualizing based on figure convolutional network
CN110659394A (en) * 2019-08-02 2020-01-07 中国人民大学 Recommendation method based on two-way proximity
CN110889001A (en) * 2019-11-25 2020-03-17 浙江财经大学 Big image sampling visualization method based on image representation learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MICHAEL FERRON;KEN Q.PU;JAROSLAW SZLICHTA: "ARC: A pipeline approach enabling large-scale graph visualization", 《2016 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING》 *
刘良军: "面向大规模复杂网络的图布局算法", 《中国优秀硕士学位论文全文数据库基础科学辑》 *
张芸怡: "面向大规模网络表示学习的度偏置采样算法研究", 《中国优秀硕士学位论文全文数据库基础科技辑》 *
魏世超等: "基于E-t-SNE的混合属性数据降维可视化方法", 《计算机工程与应用》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127697A (en) * 2021-03-30 2021-07-16 清华大学 Method and system for optimizing graph layout, electronic device and readable storage medium
CN113127697B (en) * 2021-03-30 2022-11-15 清华大学 Method and system for optimizing graph layout, electronic device and readable storage medium
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN113158391B (en) * 2021-04-30 2023-05-30 中国人民解放军国防科技大学 Visualization method, system, equipment and storage medium for multidimensional network node classification
WO2022251178A1 (en) * 2021-05-25 2022-12-01 Visa International Service Association Systems, methods, and computer program products for generating node embeddings
CN116171435A (en) * 2021-05-25 2023-05-26 维萨国际服务协会 System, method and computer program product for generating node embeddings
CN113536663A (en) * 2021-06-17 2021-10-22 山东大学 Graph visualization method and system based on ring constraint and stress model
CN113536663B (en) * 2021-06-17 2023-08-25 山东大学 Graph visualization method and system based on ring constraint and stress model
WO2024098195A1 (en) * 2022-11-07 2024-05-16 华为技术有限公司 Embedding representation management method and apparatus

Also Published As

Publication number Publication date
CN112417633B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN112417633B (en) Large-scale network-oriented graph layout method and device
Chua et al. Modeling temporal adoptions using dynamic matrix factorization
CN107102727B (en) Dynamic gesture learning and recognition method based on ELM neural network
CN111737535B (en) Network characterization learning method based on element structure and graph neural network
KR102047953B1 (en) Method and System for Recognizing Faces
Navgaran et al. Evolutionary based matrix factorization method for collaborative filtering systems
CN113254717A (en) Multidimensional graph network node clustering processing method, apparatus and device
Xu et al. Graph partitioning and graph neural network based hierarchical graph matching for graph similarity computation
Liang et al. A broad learning approach for context-aware mobile application recommendation
Yan et al. Unsupervised facial expression recognition using domain adaptation based dictionary learning approach
Vaswani et al. Horde of bandits using gaussian markov random fields
Eravci et al. Location recommendations for new businesses using check-in data
Somervuo Online algorithm for the self-organizing map of symbol strings
CN115270007A (en) POI recommendation method and system based on mixed graph neural network
Zhou et al. Unsupervised multiple network alignment with multinominal gan and variational inference
Mosler et al. Computing zonoid trimmed regions of dimension d> 2
Irfan et al. MobiContext: A context-aware cloud-based venue recommendation framework
Löwe et al. The hopfield model on a sparse erdös-renyi graph
CN116361643A (en) Model training method for realizing object recommendation, object recommendation method and related device
Lolakapuri et al. Computational aspects of equilibria in discrete preference games
CN108427762A (en) Utilize the own coding document representing method of random walk
Struchenkov Piecewise parabolic approximation of plane curves with restrictions in computer-aided design of road routes
Noveiri et al. ACFC: ant colony with fuzzy clustering algorithm for community detection in social networks
Giannakopoulos et al. A decision tree based approach towards adaptive profiling of cloud applications
Sikder Averaging dynamics, mortal random walkers and information aggregation on graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant