CN112417633A

CN112417633A - Large-scale network-oriented graph layout method and device

Info

Publication number: CN112417633A
Application number: CN202011384170.6A
Authority: CN
Inventors: 魏迎梅; 韩贝贝; 窦锦身; 康来; 谢毓湘; 蒋杰; 杨雨璇; 万珊珊; 冯素茹
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-02-26
Anticipated expiration: 2040-12-01
Also published as: CN112417633B

Abstract

The invention discloses a large-scale network-oriented graph layout method, which comprises the following steps: representing each node in the graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, and constructing an embedded matrix of the graph data; and projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space. The invention also discloses a large-scale network-oriented graph layout device. The invention has higher computational efficiency and less required storage space, can keep the local and global structural characteristics of the graph data, and can make the nodes with higher degree values in the graph data relatively dispersed with the neighbor nodes under the condition of keeping the local structural information, thereby effectively relieving the possible crowding or overlapping phenomena.

Description

Large-scale network-oriented graph layout method and device

Technical Field

The invention belongs to the field of network data processing, and particularly relates to a large-scale network-oriented graph layout method and device.

Background

In the face of increasingly large-scale data, graph visualization has become an important network data analysis method, and plays an important role in many application fields, such as biomedical networks, chemical molecular networks, traffic networks, financial transaction networks, academic collaboration networks, social networks, and the like. The graph visualization is composed of a graph layout, a network attribute expression and reasonable user interaction, and the most core element is the graph layout, and the network attribute expression and the reasonable user interaction are based on the premise of good graph layout, so that one of the main research contents in the graph visualization field is the graph layout.

The graph layout method mainly classifies two types: force-guided based graph layouts and data-dimension-reduction based graph layouts. The graph layout based on force guidance is realized by modeling a graph as a physical system, wherein nodes in the graph represent steel rings, connecting edges between the nodes are analogized to springs, and the attractive force and the repulsive force in the spring system are simulated. Firstly, a random initial state is set for the system, the nodes in the system continuously iterate and update the positions of the nodes under the interaction of attractive force and repulsive force, and iteration is stopped until the nodes in the whole system are stressed in a balanced manner, and at the moment, the system reaches a stable state. The data dimension reduction-based graph layout keeps the similarity of node distribution in a graph space and node distribution in a two-dimensional layout space by minimizing an objective function, so that the distribution of the two-dimensional layout is as close to the distribution in the graph space as possible, and the structure and attribute information of original graph data is strived to be kept and reflected.

The force-guided layout algorithm is simple and easy to implement, however, the method can only achieve local optimization, the distance between a point and a middle point in a layout space can well represent the local structure between corresponding nodes in a graph space, and the relationship between the local and local in graph data, namely global structure information, is difficult to capture. The method has the main idea that the data is reduced from a high-dimensional graph space to a two-dimensional layout space, and the proximity between node pairs in an original graph space can be kept as much as possible in the layout space, so that the overall deviation of the layout embedded in the two-dimensional space is minimum. The dimension reduction technology is divided into a linear dimension reduction technology and a nonlinear dimension reduction technology, while the traditional linear dimension reduction method such as multidimensional scale analysis can effectively analyze data structure information in a high-dimensional space in a two-dimensional space, but for nonlinear structure data, the structural relationship between data points cannot be effectively expressed through linearity, so that a nonlinear dimension reduction algorithm is derived.

The existing nonlinear dimension reduction technologies comprise the following steps:

1) tsNET algorithm. tsNET is an improved objective function on the basis of a t-SNE nonlinear dimension reduction method, a new optimized objective function is constructed on the basis of KL divergence of difference between measurement map space node distribution and layout space node distribution by combining a compression term and a repulsion term, a good map layout result is obtained by continuously iteratively optimizing the objective function, and the crowding phenomenon between nodes is relieved.

2) tsNET algorithm. the tsNET algorithm initializes the coordinate values of each node in the layout space by adopting the PivotMDS method on the basis of the tsNET method, and although the complexity of the algorithm is increased, the layout quality of the graph is greatly improved compared with the tsNET method.

tsNET and tsNET methods are more flexible than traditional force-directed placement methods. However, the algorithm first calculates the shortest path distance of graph theory between node pairs to obtain the shortest path distance matrix with the size of | V | × | V |. Each i row in the matrix represents the shortest path distance from the ith node to the rest of the nodes in the graph data, and is represented by a vector with length | V |. Then, based on the matrix, the similarity between node pairs is calculated in a conditional probability mode, the method needs to sequentially traverse the distance between the node pairs, and the calculation complexity is proportional to the quadratic power of the number of the graph data nodes, namely O (| V | N²). The shortest path distance matrix is stored with the same complexity and proportional to the quadratic power of the number of nodes, i.e. O (| V²). For graph data with a very large number of nodes, the spatial and temporal complexity of obtaining the matrix is unacceptable.

In summary, the graph layout technology based on nonlinear dimension reduction mainly has the following problems: first, when calculating the similarity between nodes in graph data, it is necessary to construct a Shortest Path Distance Matrix (SPDM) of the graph data by using the Shortest Path distance in graph theory, where the size of the Matrix is | V | × | V |, and store the Matrix for subsequent use. If no path exists between two node pairs, the path distance is set to a positive infinite number. The time and space complexity required by the process is the quadratic power of the number of nodes, namely O (| V²). For large scale network topologies, this temporal and spatial complexity is intolerable. Secondly, when an objective function is defined by the existing graph layout technology based on nonlinear dimension reduction, the centrality difference of different nodes is not considered, the number of nodes with higher values is large, and the nodes may overlap in a two-dimensional layout space.

Disclosure of Invention

The invention provides a method and a device for diagram layout facing a large-scale network, aiming at overcoming the defects of the existing diagram layout technology based on nonlinear dimension reduction.

The technical scheme adopted by the invention is as follows:

in a first aspect, a method for graph layout for large-scale networks includes:

representing each node in graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, and constructing an embedded matrix of the graph data;

and projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.

Preferably, the constructing an embedded matrix of the graph data by representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model includes:

for one node, acquiring the neighbors of the node through a random walk sampling strategy to obtain a walk sequence;

dividing the walking sequence through a window to obtain a training sample sequence;

constructing a first objective function;

and inputting the training samples into a Skip-Gram model, optimizing the first objective function by a random gradient descent method, and learning to obtain a low-dimensional dense vector of the node.

Preferably, the constructing the first objective function includes:

setting an initial objective function, wherein the initial objective function is as follows:

wherein, f is V → R^kFor the mapping from node to k dimension, k is a predetermined parameter, and 2 < k < | V |, f is an embedding matrix with a size of | V | × k, | V | is a node set size of (V, E) of graph data G, P is a node set size of (V, E)_r(N_s(u) | f (u)) is a conditional probability;

when conditional independence is present, the conditional probability is:

when the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is:

obtaining a first objective function according to the initial objective function and the conditional probability, wherein the first objective function is as follows:

preferably, the obtaining a map layout result of the map data in the two-dimensional space by projecting the embedded matrix through an improved nonlinear dimension reduction algorithm includes:

according to the embedding matrix, adopting conditional probability measurement to measure the similarity between two nodes represented by the low-dimensional dense vector, and obtaining P distribution describing the node similarity;

measuring the proximity between two nodes in the two-dimensional space through Student-t distribution to obtain Q distribution for describing the node proximity;

measure the difference between the P distribution and the Q distribution by KL divergence;

constructing a second objective function according to the difference between the P distribution and the Q distribution and the edge repulsive force strategy of node centrality;

and optimizing the second objective function to obtain the coordinate value of the node in the graph data in the two-dimensional space.

Preferably, the node centrality edge repulsion strategy is that a measure node centrality index is added in front of a repulsion term of an objective function for constraint; the node centrality index is a value of the node.

Preferably, the P distribution is:

wherein, | | x_i-x_jI is a node x represented by the low-dimensional dense vector_iAnd node x_jDistance between, δ_iIs node x_iIs the variance of the gaussian distribution of the center point;

the Q distribution is:

wherein, | | y_i-y_j| | represents a node y in the two-dimensional space_iAnd node y_jDistance in Euclidean layout space;

the KL divergence between the P and Q distributions is:

preferably, the second objective function is:

wherein the first term argmin lambda_KLC_KLTo measure KL divergence of P and Q distribution differences, the second term

Is the early compression term, the third term

Is a bagContaining weight constraint M_ijTerm of repulsion,. epsilon_rFor preventing y_i-y_jParameter of singular point appearing when | | is approximately equal to 0, λ_KL、λ_cAnd λ_rAre parameters that weigh the importance of these three terms to the second objective function.

Preferably, after the obtaining a graph layout result of the graph data in a two-dimensional space by projecting the embedded matrix through a modified nonlinear dimension reduction algorithm, the method includes:

measuring the layout result of the graph according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.

In a second aspect, a large-scale network-oriented graph layout apparatus includes:

the learning module is used for representing each node in graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data;

and the projection module is used for projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.

Preferably, the map layout apparatus for a large-scale network further includes:

the aesthetic measurement module is used for measuring the graph layout result according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.

By adopting the technical scheme, the invention has the following beneficial effects:

1) the network embedded representation model based on machine learning performs representation learning of the nodes to obtain low-dimensional dense vector representation of each node, and the vectors are combined to obtain an embedded matrix, so that the calculation efficiency is higher, the required storage space is less, and the local and global structural features of graph data can be maintained.

2) The embedded matrix is projected based on an improved nonlinear dimension reduction algorithm to obtain the graph layout result of the graph data in a two-dimensional space, particularly for the nodes with higher degree values in the graph data, the nodes can be relatively dispersed with the neighbor nodes under the condition of keeping local structure information, and the possible crowding or overlapping phenomenon can be effectively relieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a large-scale network-oriented graph layout method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the step S10 of the large-scale network-oriented graph layout method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of breadth-first search and depth-first search strategies in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a sampling strategy for random walk according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating the step S20 of the large-scale network-oriented graph layout method according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a large-scale network-oriented graph layout method according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of a large-scale network-oriented graph layout apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In an embodiment of the present invention, as shown in fig. 1, an embodiment of the present invention provides a large-scale network-oriented graph layout method, including:

and step S10, representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data.

In the present embodiment, each node V in the original graph data G ═ (V, E) is represented by a machine learning-based network-embedded representation model_iUsing low-dimensional dense vectors x_i∈R^kWhere 2 < k < | V |, the low-dimensional dense vector x_iBeing able to characterize the node v_iAnd the structural similarity relationship between the nodes and the low-order and high-order neighbor nodes thereof, the low-dimensional dense vector X of the | V | nodes is { X ═ X₁,x₂,...,x_|V|And forming an embedding matrix with the size of | V | × k, wherein the embedding matrix can not only represent the local structural relationship in the graph data, but also describe the structural features between the local and local parts in the graph data, namely the global structural information. The ith row, embedded in the matrix, represents the current node i captured with its first and higher order neighborsAnd (4) structural relationship.

It should be noted that, in step S10, although the coordinate values of the nodes in the two-dimensional space can be obtained by embedding a representation model in the network based on machine learning and representing each node in the graph data by a vector with a size of 2, that is, k is 2, in this case, the structural features of the node and its neighboring nodes cannot be well captured, and information loss is relatively large, which affects the layout quality of the graph layout result and subsequent data analysis and mining. Therefore, it is necessary to perform step S20 to obtain a final graph layout result further based on the embedded matrix obtained in step S10.

And step S20, projecting the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space.

First, based on an original node V ═ V₁,v₂,...,v_|V|The low-dimensional dense vector X ═ X₁,x₂,...,x_|V|Expressing two nodes x in the form of conditional probability_iAnd x_jThe similarity between them, the probability distribution P is obtained. Similarly, two nodes y in two-dimensional space_iAnd y_jThe proximity between them is also expressed in the form of a probability distribution Q.

Then, when the graph layout is carried out, an edge repulsion strategy based on node centrality is adopted, namely a constraint for measuring the node centrality index is added in front of the repulsion item of the objective function. The node centrality index may be a value of a node, that is, the number of connected edges of the node. The edge repulsion force strategy based on the node centrality can distribute different repulsion forces according to the central characteristics of different nodes. Particularly, for a node with a large degree value in graph data, the node can be dispersed from a neighbor node under the condition of keeping the local structural feature of the node, and the phenomenon of possible congestion or overlapping with the neighbor node is relieved.

Finally, defining an objective function, combining early compression and a node centrality-based edge repulsion strategy while minimizing the difference between P distribution and Q distribution, and continuously iterating and optimizing the objective function to finally obtain the nodesCoordinate value Y of a point in two-dimensional space ═ Y₁,y₂,...,y_|V|I.e., the layout result of the drawing data in a two-dimensional space.

In summary, in the embodiment, the network embedded representation model of machine learning is first used to perform representation learning on the nodes to obtain the low-dimensional dense vector representation of each node, and the vectors are combined to obtain the embedded matrix, so that the computation complexity is low, the required storage space is greatly reduced, and all structural information of the graph data can be maintained. And then, projecting the embedded matrix based on an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space, and particularly enabling the node with a higher degree value in the graph data to be relatively dispersed with a neighbor node under the condition of keeping local structure information, so that possible crowding or overlapping phenomena can be effectively relieved. In addition, in the embodiment, only two steps of strategies are needed to obtain a high-quality graph layout result, which is beneficial to improving the graph layout efficiency.

As still another embodiment of the present invention, as shown in fig. 2, the step S10 may include the following steps:

and S101, for one node, acquiring the neighbors of the node through a random walk sampling strategy to obtain a walk sequence.

In this embodiment, for a node u, a random walk along a network structure (connection between nodes) is adopted, and a neighbor of the node u is sampled according to a specified sampling strategy, so as to obtain a walk sequence

Wandering sequence N_sAnd (u) is a neighbor node of the node u obtained by sampling, and the obtained neighbor nodes have difference according to different sampling strategies.

For a given graph data G ═ (V, E), if node n is present₀As a starting node, the wander sequence is sampled, and the length of the wander path is denoted by l. Other nodes in the walking sequence can be generated through transition probability among the nodes, and the specific formula is as follows:

in the formula (1), the first and second groups,

is a node n₀And n₁B is a normalized constant. Let transition probability

Wherein

In the formula (2), p is a return parameter, q is an in-out parameter, and p and q are used to control whether the search process of random walk is according to breadth-first search (BFS) or depth-first search (DFS), as shown in the schematic diagram of breadth-first search and depth-first search strategies in fig. 3, so that the local structure and global structure information of the node can be captured. The random walk sampling strategy diagram shown in FIG. 4 assumes that the current walk is to n₁Node, then the next node n is selected₂Is formed by the parameters p and q and

it is decided that,

representing a node n₀And n₂The shortest path distance of (a) is,

representing the weights between the nodes. p is a return parameter and q is an in-out parameter.

And step S102, dividing the walking sequence through a window to obtain a training sample sequence. That is, a window with a size w is defined, and l-w training sample sequences can be generated by sliding the window along the walking sequence l in sequence.

Step S103, a first objective function is constructed.

First, an initial objective function is obtained, which may be defined as:

in formula (3), f: V → R^kFor the mapping from the node to the k dimension, k is a preset parameter, and 2 < k < | V |, the final learning result f is an embedded matrix with the size of | V | × k, namely, each node in the original graph data is represented by a characteristic vector with the size of k, and the vector captures the structural relationship between the node and the first-order and high-order neighbor nodes thereof.

Secondly, assuming conditional independence, the conditional probability is expressed as:

assuming that the influence between two nodes is symmetric in the k-dimensional feature space, the conditional probability is further expressed as:

and finally, obtaining a first objective function according to the initial objective function and the conditional probability.

Combining equation (5) and equation (3), the objective function is finally expressed as:

and S104, inputting the training samples into the Skip-Gram model, optimizing the first objective function through a random gradient descent method, and learning to obtain the low-dimensional dense vector of the node. That is, the node V ═ V in the original image data₁,v₂,...,v_|V|Finally learnTo a low dimensional dense vector denoted X ═ X₁,x₂,...,x_|V|}。

As still another embodiment of the present invention, as shown in fig. 5, the step S20 may include the following steps:

step S201, two nodes x represented by the low-dimensional dense vector are measured by adopting conditional probability according to the embedded matrix_iAnd x_jAnd obtaining the P distribution describing the similarity of the nodes according to the similarity between the nodes.

Wherein, the P distribution is specifically expressed as:

the expressions (7) and (8) mean that for two nodes x_iAnd x_j，x_iWith conditional probability p_j|iSelection of x_jAs its proximity point, if two nodes have high similarity, they are selected with high probability. Conversely, a dissimilar data point will have a very low probability of being its neighbor. If two points are far apart, the conditional probability p_j|iIt will be very small. The embedded matrix X ═ X can be obtained by equations (7) and (8)₁,x₂,...,x_|V|P distribution of.

In the formula (7), | | x_i-x_jIs two nodes x_iAnd x_jDistance between, δ_iIs node x_iIs the variance of the gaussian distribution of the center point. Since only the similarity between different nodes is considered in the present embodiment, p can be assumed_ii＝0。

Step S202, measuring two nodes y in the two-dimensional space through Student-t distribution_i、y_jAnd obtaining a Q distribution for describing the node proximity.

Wherein, the Q distribution is specifically expressed as:

similar to the P distribution, the Q distribution makes similar nodes closer together in two-dimensional space and dissimilar nodes relatively farther apart. | | y_i-y_jI represents two nodes y_iAnd y_jDistance in the euclidean layout space. Is the same as p_iiSimilarly, q_iiThe same is assumed to be 0.

Step S203, measure the difference between the P distribution and the Q distribution by KL divergence.

Wherein, the difference between the P distribution and the Q distribution is specifically represented as:

understandably, when the P distribution is approximated by the Q distribution, the amount of information lost should be kept as much as possible of the information of the original graph data, so that the smaller the difference between the P distribution and the Q distribution, the better the Q distribution is, so that the Q distribution_ijReflecting p as much as possible_ij。

And step S204, constructing a second objective function according to the edge repulsive force strategy of the node centrality.

The node with a lower value can be enabled to have a smaller repulsive force with a neighbor node according to the difference of the center characteristics of different nodes based on the edge repulsive force strategy of the node center, namely, the weight constraint for measuring the node center is added in front of the repulsive force term of the objective function; the repulsion between the node with a higher reverse value and the neighbor is larger, and under the condition of keeping the structural characteristics of the current node and the neighbor, the nodes are scattered to be opened a little, so that the phenomenon of possible overlapping or crowding can be effectively relieved.

The second objective function is represented as:

in formula (11), the first term argmin λ_KLC_KLIs the KL divergence that measures the difference between the P and Q distributions, see equation (10). Second item

Is an early compression term that produces better results when the projection remains near the origin in the early stages of optimization, the early compression term being non-zero only in the first half of the optimization process; y is_iIs a node v_i(also corresponding to the representation of the low-dimensional dense vector x_i) A coordinate projection in two-dimensional space; | | | | is the euclidean norm. Item III

To have weight constraint M_ijThe weight of which constrains M_ijThe degree value centrality of different nodes is comprehensively considered, and the phenomena of disorder and overlapping among adjacent nodes are relieved to a certain extent; epsilon_rFor preventing y_i-y_jIf | | is approximately equal to 0, the parameter of singular point appears, if epsilon is not added_rWhen y is | |_i-y_jWith | ≈ 0, the node may be discarded, resulting in unstable optimization, but if ε_rToo small, leads to unstable optimization if ε_rToo large will result in a small gradient value for the third term, thereby reducing the effect of the repulsive term. Epsilon was obtained by a large number of experiments_r0.2 is a good trade-off. Lambda [ alpha ]_KL、λ_cAnd λ_rFor the parameters used to weigh the importance of these three terms to the second objective function, when λ_cAnd λ_rAnd when the time is zero, the time is the target function of the t-SNE algorithm, and because the target function of the t-SNE algorithm is not constrained by an early compression item and an edge repulsion strategy, the characteristic information of the original graph data cannot be well reflected by the graph layout result, the graph layout needs to be perfected by a force guiding method.

For the third term of the second objective function, the third term of the objective function in tsNET algorithm of the present embodiment

Based on different central characteristics of different nodes, a weight constraint M for measuring the centrality of the nodes is added in front of the term function_ij(M_ij＝degree_idegree_j，degree_iThe value of the node i, that is, the number of the connecting edges of the node i), the repulsion term after adding the weight constraint for measuring the centrality of the node becomes:

in the embodiment, the product of two node degrees is used as the weight of the edge repulsive force, so that the node with a large value has a larger repulsive force term, and further the distance between the node i and the node j is slightly longer, so that the possible overlapping phenomenon can be prevented, and the strategy can be used as the edge repulsive force strategy based on the node centrality.

In other embodiments, M may be varied according to different needs and purposes_ijOther metrics that measure node centrality, such as affinity, feature vector centrality, and betweenness centrality, may also be used.

It should be noted that the third term of the objective function of tsNET algorithm

Although the phenomena of clutter and overlap which may occur between nodes in the two-dimensional layout can be alleviated to a certain extent, the term function applies the same weight constraint to all nodes in the graph data, and the targeted layout calculation cannot be performed according to different characteristics of different nodes, especially for nodes with high centrality indexes.

Step S205, optimizing the second objective function to obtain coordinate values of the nodes in the graph data in the two-dimensional space.

In this embodiment, the second objective function is iteratively optimized by using a random gradient descent method, and the iteration is stopped until the algorithm converges, so that a graph layout result of the graph data in the two-dimensional space can be obtained.

As still another embodiment of the present invention, as shown in fig. 6, the step S20 may be followed by the following steps:

step S30, measuring the layout result of the graph according to a preset aesthetic index to obtain the layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.

Among them, the number of edge crossings (crossless) is one of the most important aesthetic criteria in the layout of the drawing, and the smaller the number of edge crossings, the better the number of edge crossings. The number of edge crossings can be expressed as:

in the formula (12), c is the number of intersections of edges, c_maxIs an approximate upper bound on the number of edge crossings, c_maxThe concrete expression is as follows:

in equation (13), | E | is the edge set size of the graph data G (V, E).

The Minimum angle (Minimum angle metric) is defined as the mean absolute deviation of the Minimum angle of an edge on node v from the ideal Minimum angle. The minimum angle can be expressed as:

in the formula (15), the ideal minimum included angle is that all edges on the node are exactly halved by 360 degrees,

in this embodiment, the layout quality of the graph layout result can be reliably evaluated by using the edge intersection number and the minimum angle degree graph layout result.

In addition, the embodiment of the invention also provides a large-scale network-oriented graph layout device, which can realize the large-scale network-oriented graph layout method in the embodiment. As shown in fig. 7, the map layout apparatus for large-scale network includes a learning module 110 and a projection module 120, and the detailed description of each functional module is as follows:

and the learning module 110 is used for representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model, and constructing an embedded matrix of the graph data.

And a projection module 120, configured to project the embedded matrix through an improved nonlinear dimension reduction algorithm, so as to obtain a graph layout result of the graph data in a two-dimensional space.

Further, as shown in fig. 7, the learning module 110 includes a sampling sub-module 111, a dividing sub-module 112, a first constructing sub-module 113, and a learning sub-module 114, and the detailed description of each functional sub-module is as follows:

and the sampling sub-module 111 is configured to acquire, for one node, neighbors of the node through a random walk sampling strategy, and obtain a walk sequence.

And a dividing submodule 112, configured to divide the walking sequence through a window to obtain a training sample sequence.

A first construction submodule 113 for constructing a first objective function.

And the learning submodule 114 is used for inputting the training samples into the Skip-Gram model, optimizing the first objective function by a random gradient descent method, and learning to obtain a low-dimensional dense vector of the node.

Further, the first constructing sub-module 113 is further configured to set an initial objective function, where the initial objective function is:

when conditional independence is present, the conditional probability is:

further, as shown in fig. 7, the projection module 120 includes a P distribution submodule 121, a Q distribution submodule 122, a difference metric submodule 123, a second construction submodule 124, and a projection submodule 125, and the detailed description of each functional submodule is as follows:

and the P distribution submodule 121 is configured to obtain, according to the embedded matrix, a P distribution describing similarity of the nodes by using the similarity between two nodes represented by the low-dimensional dense vector by the conditional probability metric.

And the Q distribution submodule 122 is used for measuring the proximity between two nodes in the two-dimensional space through Student-t distribution, and obtaining Q distribution used for describing the proximity of the nodes.

A difference metric sub-module 123 for measuring a difference between the P distribution and the Q distribution by a KL divergence.

And a second constructing submodule 124, configured to construct a second objective function according to the difference between the P distribution and the Q distribution and the edge repulsive force strategy of node centrality.

And the projection submodule 125 is configured to optimize the second objective function, and obtain a coordinate value of the node in the two-dimensional space in the graph data.

Further, the edge repulsive force strategy of the node centrality in the second construction sub-module 124 is to add a measure node centrality index in front of the repulsive force term of the objective function for constraint; the node centrality index is a value of the node.

Further, the P distribution in the P distribution submodule 121 is:

wherein, | | x_i-x_jI is a node x represented by the low-dimensional dense vector_iAnd node x_jDistance between, δ_iIs node x_iIs the variance of the gaussian distribution of the center point.

Further, the Q distribution in the Q distribution submodule 122 is:

wherein, | | y_i-y_j| | represents a node y in the two-dimensional space_iAnd node y_jDistance in the euclidean layout space.

Further, the KL divergence between the P distribution and the Q distribution in the difference metric submodule 123 is:

further, the second objective function in the second building submodule 124 is:

Is the early compression term, the third term

To contain a weight constraint M_ijTerm of repulsion,. epsilon_rFor preventing y_i-y_jParameter of singular point appearing when | | is approximately equal to 0, λ_KL、λ_cAnd λ_rAre parameters that weigh the importance of these three terms to the second objective function.

Further, as shown in fig. 7, the large-scale network-oriented graph layout apparatus further includes an aesthetic measurement module 130, and the functional sub-modules are described in detail as follows:

an aesthetic measurement module 130, configured to measure the graph layout result according to a preset aesthetic index, so as to obtain layout quality; the aesthetic criteria include the number of edge crossings and the minimum angle.

The graph layout device facing the large-scale network, provided by the embodiment of the invention, can represent each node in graph data as a low-dimensional dense vector through a network embedded representation model based on machine learning, construct an embedded matrix of the graph data, and then project the embedded matrix through an improved nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space, so that the computation complexity is low, the required storage space is greatly reduced, all structural information of the graph data can be maintained, particularly, for a node with a higher degree value in the graph data, the node can be relatively dispersed with a neighbor node under the condition of maintaining local structural information, and the possible congestion or overlapping phenomenon can be effectively relieved.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A large-scale network-oriented graph layout method is characterized by comprising the following steps:

2. The large-scale network-oriented graph layout method according to claim 1, wherein the constructing the embedded matrix of the graph data by representing each node in the graph data as a low-dimensional dense vector through a machine learning-based network embedded representation model comprises:

constructing a first objective function;

3. The large-scale network-oriented graph layout method according to claim 2, wherein the constructing the first objective function comprises:

when conditional independence is present, the conditional probability is:

4. the method for the graph layout facing the large-scale network according to claim 1, wherein the obtaining the graph layout result of the graph data in the two-dimensional space by projecting the embedded matrix through a modified nonlinear dimension reduction algorithm comprises:

5. The large-scale network-oriented graph layout method according to claim 4, wherein the node centrality edge repulsion strategy is characterized in that a measure node centrality index is added in front of a repulsion term of an objective function for constraint; the node centrality index is a value of the node.

6. The large-scale network-oriented graph layout method according to claim 4, wherein the P distribution is:

the Q distribution is:

the KL divergence between the P and Q distributions is:

7. the large-scale network-oriented graph layout method according to claim 4, wherein the second objective function is:

wherein, the first term arg min lambda_KLC_KLTo measure KL divergence of P and Q distribution differences, the second term

Is the early compression term, the third term

8. The large-scale network-oriented graph layout method according to claim 1, wherein the step of projecting the embedded matrix through a modified nonlinear dimension reduction algorithm to obtain a graph layout result of the graph data in a two-dimensional space comprises:

9. A large-scale network-oriented graph layout apparatus, comprising:

10. The large-scale network-oriented graph layout apparatus of claim 9, further comprising: