WO2018151619A1

WO2018151619A1 - Network analysis tool testing

Info

Publication number: WO2018151619A1
Application number: PCT/RU2017/000085
Authority: WO
Inventors: Alexander Nikolaevich Filippov; Mikhail DROBYSHEVSKY; Anton KORSHUNOV; Ilya KOZLOV; Xuecang ZHANG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2018-08-23
Also published as: CN110313150B; CN110313150A

Abstract

Provided is a process and a device for testing a functionality of a network analysis tool. The process involves receiving an input network dataset, the input network dataset defining a first graph comprising nodes and edges, wherein the edges represent connections between the nodes. The process further involves mapping the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determining edges connecting nodes of the second graph, based on the similarity function.

Description

NETWORK ANALYSIS TOOL TESTING

FIELD

The present disclosure relates to testing network analysis tools. In particular, the present disclosure relates to testing a functionality of a network analysis tool by analyzing artificially generated network data with the network analysis tool.

BACKGROUND

In order to test a network analysis tool such as a network mining tool, the network analysis tool may be fed with graphs representing different genuine or artificial networks and the analysis results may be evaluated to verify the functionality of the tool. In this regard, it may be beneficial to use a plurality of different graphs with similar properties to achieve statistically meaningful results.

SUMMARY

According to a first aspect of the present invention, there is provided a method for testing a functionality of a network analysis tool, the method comprising receiving an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, mapping the nodes to a first set of vectors, wherein the mapping determines a similarity function assigning connection scores to vector pairs, determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determining edges connecting nodes of the second graph, based on the similarity function.

In this regard, it is noted that the term "network analysis tool"^* as used throughout the description and claims may equally refer to software, hardware, or a combination of software and hardware. For instance, the network analysis tool may be a combination of hardware and software, e.g., a computing device storing computer-readable instructions, which receives the input network dataset.

As the input network dataset describes a network topology, e.g., computing devices and communication links, the network topology may be analyzed to derive patterns that allow for an enhanced insight into the network topology. Furthermore, effects of changes in the network topology may be analyzed, e.g., by randomly or systematically modifying the network topology. For instance, an impact of different changes on the network topology may be simulated or critical changes may be extracted. Similarly, the network topology may be modified to increase robustness of the network topologies to adverse events such as node or communication link malfunctions.

However, it is clear to persons of skill in the art that communication networks are just one example for a network that could be analyzed using the network analysis tool. Because, in principle, the network analysis tool may be used to analyze a wide range of different networks. For example, the network analysis tool may be used to analyze a transport network, one or more linked webpages, biological systems, the syntax of a (natural) language, a retail network, an advertising network, or a social network and in fact any kind of network having a topology which is susceptible to be described by a graph.

Moreover, the term "similarity function" as used throughout the description and claims in particular refers to a function that quantifies the similarity between nodes by a connection score, wherein a higher similarity, which may be represented by a higher connection score, may indicate a higher likeliness of an edge between the nodes and hence, connecting the nodes. For example, the connection score may be a real number, wherein higher numbers indicate a higher likeliness of the respective nodes being connected by an edge.

In a first possible implementation form of the first aspect, the method further comprises using the network analysis tool to analyze a network comprising the nodes and the edges of the second graph. Hence, the network analysis tool may receive the second graph as input and derive one or more patterns from the second graph. As the second graph is derived from the first graph, the second graph may differ in size from the first graph, e.g., the second graph may comprise less than half, less than one-fifth of, less than one-tenth of, less than one-hundredth of, etc., or more than two times, more than five times, more than ten times, more than one-hundred times, etc., the number of nodes of the first graph, but still exhibit the same or similar properties/patterns as the first graph. For example, an analysis result of the first graph and an analysis result of the second graph may be compared and if the results are not consistent with each other, the network analysis tool may be adapted/corrected or discarded. Optionally, further tests may be performed to analyze a statistical meaning and/or the basis of observed deviations.

In a second possible implementation form of the first aspect, the first graph is a directed graph and determining edges connecting nodes of the second graph comprises determining, for each ordered node pair of the second graph, whether an edge connects a first node of the node pair to a second node of the node pair, based on a first connection score, and whether the edge connects the second node of the node pair to the first node of the node pair, based on a second connection score.

Accordingly, the presented method can be used to generate graphs having directed connections such as, for example, graphs representing data traffic such as graphs representing a (wireless) communication network, the distribution of goods, etc.

In a third possible implementation form of the first aspect, all vectors of the second set of vectors are determined based on randomly or pseudo-randomly drawing vectors from a multidimensional probability distribution approximated from the first set of vectors.

For instance, a multidimensional probability distribution may be fitted to the first set of vectors. Accordingly, a structure of the original graph may be preserved while the artificial graphs may be up-scaled or down-scaled. Moreover, the original graph may not be recoverable from the artificial graphs, thereby allowing to render features of a network open to the public while keeping the detailed network structure private/confidential. In a fourth possible implementation form of the first aspect, all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors.

For example, the second set may comprise a subset of the vectors of the first set and/or the vectors of the first set may be duplicated to generate a down-scaled or up-scaled artificial graph having similar properties.

In a fifth possible implementation form of the first aspect, all vectors of the second set of vectors are determined by selecting a vector of the first set of vectors and adding a noise vector to the selected vector.

Hence, besides generating a down-scaled or up-scaled artificial graph, graph properties may be randomly modified to provide the artificial graph with similar yet randomly modified properties compared to the original graph, thereby allowing systematically testing the significance of the network analysis tool results.

In a sixth possible implementation form of the first aspect, the noise vector is randomly or pseudo-randomly drawn from a multidimensional Gaussian probability distribution.

Hence, a multitude of statistically similar yet different artificial graphs may be generated that provide for graph samples within a region around the original graph.

In a seventh possible implementation form of the first aspect, the nodes of the first graph are assigned to communities and a node of the second graph corresponding to a selected vector of the first set inherits a respective community assignment of the node corresponding to the selected vector.

The communities may, for example, represent sets of densely connected nodes while the sets are more sparsely connected to each other than to the rest of the network. Hence, artificial graphs with communities having a similar (in a statistical sense) yet different structure compared to the original graph may be generated.

In a eighth possible implementation form of the first aspect, the edges of the first graph are assigned weights and an edge of the second graph connecting nodes corresponding to selected vectors of the first set inherits a respective weight of an edge of the first graph connecting the nodes corresponding to the selected vectors.

Hence, additional features of the network modelled by the graph may be encoded in the edge weights and at least partially preserved in the generated artificial graphs. For example, the weights may correspond to bandwith of a communication link, transport capacity, etc.

In an ninth possible implementation form of the first aspect, if the nodes corresponding to the selected vectors are not connected in the first graph, said edge of the second graph is assigned a minimal weight among all edges of the first graph.

Hence, a weight structure of the original graph may be maintained while generating artificial graphs having a similar characteristic than the original graph.

In a tenth possible implementation form of the first aspect, determining edges connecting nodes of the second graph based on the similarity function further comprises comparing connection scores of pairs of nodes of the second graph with a threshold.

For example, edges between nodes of the second graph may be added, if the connection scores of the respective node pairs are above the threshold.

In an eleventh possible implementation form of the first aspect, the threshold is determined to discriminate, based on the similarity function, top-E node pairs of the first graph with relatively higher connection scores from the rest of node pairs, where E is a number of the edges in the first graph. According to a second aspect of the present invention, there is provided a computer-readable medium, storing instructions which if executed by a computer cause the computer to load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determine edges connecting nodes of the second graph, based on the similarity function.

For example, the computer may be provided with a storage storing the instructions and the input network data set, or the computer may retrieve the input data set via a network connection. Moreover, the computer may be caused, by executing instructions stored on the computer-readable medium, to analyze network data and generate the input network dataset. For instance, the instructions may cause the computer to request data on computing entities and data connections between the computing entities of a network and to map the computing entities to nodes of the first graph and the data connections to edges of the first graph.

In a first possible implementation form of the second aspect, the computer-readable medium further stores instructions which if executed by the computer cause the computer to execute a network analysis tool and analyze a network comprising the nodes and the edges of the second graph. Hence, a modified, e.g., down-scaled or up-scaled second graph can be derived from the first graph, wherein the derived graph has similar properties as the first graph. Thus, a comparison between the results of an analysis of a network corresponding to the first graph and networks corresponding to derived second graphs can be used to verify that the network analysis tool derives similar patterns when analyzing networks having similar properties. According to a second aspect of the present invention, there is provided a network analysis tool testing apparatus, comprising a processor and persistently stored instructions which, if executed by the processor, cause the processor to load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, determine edges connecting nodes of the second graph, based on the similarity function, and store an output network dataset, the output network dataset defining the second graph.

For instance, the apparatus may implement the method according to the first aspect and the implementation forms of the first aspect and use the second graph to test the network analysis tool. BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 shows a flow-chart of a process for generating an output graph from an input graph;

Fig. 2 shows exemplary input and output graphs used/generated by the process of Fig. 1 ;

Fig. 3 illustrates the application of the process of Fig. 1 for use in relation to a network mining tool; and Fig. 4 shows a block diagram of a network mining tool testing apparatus. DETAILED DESCRIPTION

Fig. 1 and Fig. 2 illustrate a process 10 for generating an output graph 12 from an input graph 14. As indicated at step 16, the process 10 starts with receiving an input network dataset defining the input graph 14. In this regard, the following notation is used in the remainder:

For instance, the input graph 14 illustarted in Fig. 2 may be a directed weighted graph G = (N, E) with a community structure, as indicated by the circles around nodes (0,1,2,3) and

(3,4,5,6,7), respectively. Thus, each node n_t may have an assigned community label c, . However, it is to note that depending on the graph, no community label or a set of community labels

may be assigned to a node. Moreover, each edge n_j— > n . may have an assigned weight w_tj .

As indicated at step 18, the input graph 14 is mapped to vectors. In particular, the graph G = (N,E) may be embedded by mapping the nodes to real-valued vectors For

example, the directed weighted graph G may be embedded based on a bilinear link model, BLM, or using largescale information network embedding, LINE, although other techniques such as Deep Walk (cf. B. Perozzi, R. Al-Rfou, and S. Skiena, "Deepwalk: Online learning of Social Representations," in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014) or node2vec (cf. A. Grover and J. Leskovec, "Node2vec: Scalable feature learning for networks," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016) may be used instead.

For instance, a BLM designed specially for directed graphs has been suggested by O. U. Ivanov and S. O. Bartunov in "Learning Representations in Directed Networks," presented at the International Conference on Analysis of Images, Social Networks and Texts, and publisched on pages 196-207 in Volume 542 of the series "Communications in Computer and Information Science" by Springer in 2015, and is given by:

Here, the parameters are associated with input and output node representations,

hence allowing the embedding of a directed graph. Having regard to the joint link probability

the objective function of the embedding is a log-likelihood of the whole graph:

For the softmax approximation, a technique called noise contrastive estimation, NCE, presented by M. Gutmann and A. Hyvarinen in "Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models," in AISTATS, Volume 1 , 2010, page 6, may be used. This technique is directed at the estimation of unnormalized probabilistic models, treating the normalizing constant as an additional parameter. The key idea is to reduce the task of probability density learning to a binary classification problem, namely, distinguishing the data distribution p_d(x) from a noise distribution p„(x) . Applied to graph embedding and assuming that noise samples are k times more frequent than data samples, the mixture distribution takes the form:

Moreover, the posterior probability that a sample x is from the data distribution is:

If a model ρ_θ (χ) with parameter set Θ aims at fitting to the data distribution, the posterior probability becomes a function of Θ :

Logistic regression may then be used to optimize the log-likelihood of the data against the noise:

This approximates p_d(x) without a normalization requirement in regard to ρ_θ(χ) . In the BLM setup, the normalization constant becomes a new parameter

resulting in a new parameter set and the following probabilistic model:

Taking into account that p_d corresponds to the graph edges and choosing p_n as

^{tne new} objective becomes:

Thus, the initial objective in BLM may be replaced with the NCE objective which can be efficiently optimized.

The LINE approach for embedding a directed weighted graph suggested by J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, in "Line: Largescale Information Network Embedding," published in the Proceedings of the 24th International Conference on World Wide Web, ACM, 2015, pages 1067-1077, is based on so-called first-order and second-order proximities between nodes. The first-order proximity of the nodes n_t and n is indicated by the edge weight w_y and characterizes the strength of the relation of the nodes:

The second-order proximity characterizes the relationship of a node n with its context n_t :

This in fact coincides with the softmax approximation in the BLM.

Furthermore, it is suggested to optimize the two corresponding objectives separately:

The embedding vectors u_l of both models may then be concatenated.

In order to reduce the summation complexity of the denominator, negative sampling, NEG, a technique known from language modelling and described by T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, in "Distributed Representations of Words and Phrases and their Compositionality," published in the Proceedings of the Advances in Neural Information Processing Systems, 2013, pages 31 1 1 -31 19, may be used. NEG is a simplification of NCE which does not approximate the softmax but nevertheless retains the quality of the embedding vectors. This is achieved by replacing the term kp_n (x) with 1 and ignoring the normalizing constant which results in:

Substituting this into the expression for yields the NEG objective. For a single edge in

LINE this has the form:

However, it is noted that the above similarity function is just one example,

and other similarity functions may be used instead. Continuing with the example of Fig. 2, the input graph 14 G has | N |= 8 nodes and | E \= 28 edges, wherein two edges, (0,5) and (2,6), have weight 0.1 and all other edges have weight 10. As indicated above, the overlapping communities of the input graph 14 G are a = (0,1 ,2,3) and b = (3,4,5,6,7) with modularity Q_G = 0.2955 (cf. M. Drobyshevskiy, A. Korshunov, and

D. Turdakov, "Parallel Modularity Computation for Directed Weighted Graphs with Overlapping Communities," in Proceedings of the Institute for System Programming, Volume 28(6), 2016, pages 153-170). The input graph 14 G = (N, E) with N = {0,1,2,3,4,5,6,7} ,

£ = {(0,1), (1 ,0), (0,2),..., (7,6)} , weight labels {w₀₁ = 10,w₁₀ = 10,...,w₀₅ = 0.1 ,...,w₇₆ = 10} , and community labels may be embedded using

BLM with the following parameters: · epochs=600— number of epochs,

This may result in two vectors and a normalization parameter Z₍ per node:

The node pairs

may then be ranked by their similarity score. If embedding is successful, (almost) all edges

should have higher scores than non-edge pairs

and a threshold t_G in s may be set which has rank \ E \ . Said threshold t_G (approximately) separates edges

from non-edges

Then, the vectors

may be concatenated into a corresponding embedding vector for each node

Continuing with the above example, ranking all node pairs

N using the BLM similarity score computed as ) may result in the following ordered list:

The threshold t_G in s_y having rank \ E \ may thus be computed as t_a - 0.63909784 6078 . Moreover, concatenating the vectors

into an embedding vector leads to:

As indicated at step 20, the embedding vectors may be used to determine the output graph

12. For example, in order to approximate the distribution of the embedding vectors when

sampling/drawing the vectors representing the output graph 12:

• the vectors

may be randomly sampled,

(Gaussian) noise may be added to the vectors and the resulting vectors may be

randomly sampled, or

a multi-dimensional probability distribution may be fitted to the vectors and the

vectors representing the output graph 12 may be randomly drawn from said distribution.

As the number of vectors representing the output graph 12 corresponds to the size m =\ M \ of the output graph 12 H = (M, F) , 16 vectors may be sampled/drawn in relation to the example of Fig. 2. For example, m vectors may be randomly picked (with repetitions) from the set of

embedding vectors

E.g., for each j = \ ..m , i may be drawn uniformly from [1 ,| N |] and an assignment may be made. As the vectors q . correspond to the nodes of the

output graph the output graph 12 (with edges yet to be

determined) may be defined as To determine the edges, each

vector q_i may then be de-concatenated into 2 vectors of equal length

Continuing with the above example, | M |= 16 vectors may be selected from the set of

vectors

The selected vectors may then be assigned to the nodes of the output graph 12 H :

with M = {m_x ,m₂,...,m_{i 5}} . For each node

the vector may be de-concatenated into

2 vectors of equal length

In this regard, it should however be noted that instead of concatenating and de-concatenating vectors, the selecting and assigning may also be performed on the basis of vector sets, where each set contains a first vector in relation to outgoing edges and a second vector in relation to ingoing edges.

As indicated at step 22, edges between nodes of the output graph 12 H = (M,-) may then be determined based on the similarity function by connecting top | F | pairs of nodes ranked by the similarity function. For instance, the similarity score may be computed for all pairs

as softmax For pairs e M x M with score z„ > t_c ,

node m_l may be connected to node m_/ with an edge. The output graph 12 H = (M, F) may thus have a set of directed edges

Continuing with the above example, the similarity scores z . may be:

For pairs

with score node m_j may be connected to node m_} with:

As a result, the edges F = {(0,2), (0,4),..., (15,14)} between the nodes m₁ and m_j of the output graph 12 may be determined.

Moreover, weights may be assigned to each of the | F \ edges of the output graph 12 H by inheriting the edge weights of the corresponding edges of the input graph 14 G, if the input graph 14 G has different edge weights. Furthermore, for edges of the output graph 12 H which have no corresponding edges in the input graph 14 G, a minimal edge weight may be assigned, e.g., a minimal weight of all edges of the input graph 14 G. I.e., for each edge (k,l) e F , the corresponding edge weight may be determined by if the

corresponding node vectors are sampled from are connected by an

edge a minimal weight may be assigned.

Continuing with the above example:

Moreover, community labels may be assigned to each of the \ M \ nodes of graph H by inheriting corresponding community labels of the input graph 14 G, if the input graph 14 G has a community structure, i.e. community labels c_j . I.e., for each node m_j . e M , its community labels set CV may be determined by CV = C, , if the corresponding node vector q_j is sampled from node vector r_t .

Continuing with the above example:

As a result, the output graph 12 H = (M, F) with | M |= 16 nodes and | F |= 99 edges and two overlapping communities a = (0,5,6,14, 15) and b = (1 ,2,3,4,7,8,9,10,1 1 ,12,13,14) is determined, wherein the edges within the communities have high weight while the edges between the communities have low weight. Moreover, the output graph 12 H = (M, F) has a similar degree distribution and distribution of subgraphs with 3 nodes as the input graph 14 G - (N, E) and a relatively high modularity ( Q_H = 0.2374 ) in view of the communities.

In summary, the above process 10 of generating random output graphs 12 which have similar properties as a given input graph 14 provides the following benefits: automatic learning of degree distribution, subgraph distribution, and community structure from a given graph and reproducing them in synthetic graphs,

enabling synthetic graphs of arbitrary size, and

• supporting directed weighted graphs with communities. However, it is clear to the skilled person that the above process 10 is not limited to directed weighted graphs with communities but that the process 10 may also be applied to graphs which are not directed, graphs which have edges without weight, and/or graphs without community structure. Moreover, the (directed) (weighted) input graph 14 (with communities) may - in principle - be from any graph domain (social, mobile, biological, etc.).

As shown in Fig. 3, the output graph 12 may be used for the development and significance testing of network mining tools, e.g., in view of community detection. Furthermore, since the output graph 12 can be made arbitrarily large, the scalability of network mining tools can be evaluated by testing a network mining tool with multiple output graphs 12 of different size which are all generated from the same input graph 14 but differ in size. For instance, the network mining tool may be tested with output graphs 12 having half, one-fifth of, one-tenth of, one-hundredth of, etc., and/or two times, five times, ten times, one-hundred times, etc., the number of nodes of the input graph 14. The analysis results gained by analyzing such output graphs 12 may be compared and if the results are consistent with each other (for a sufficiently large number of output graphs 12), scalability of the network mining tool may be verified.

Moreover, data anonymization can be achieved which gives the possibility to make network features public without making the exact structure of the network public. Finally, if a large network may be difficult to analyze due to its size, the process 10 may be applied to create a representative sample, i.e., an output graph 12 of smaller size, of such a network with similar properties.

Fig. 4 shows a block diagram of a network mining tool testing apparatus 24. The apparatus 24 comprises a processor 26 and a computer-readable medium 28 persistently storing instructions which if executed by the processor 26 implement some or all steps of the above- described process 10.

Claims

1. A method for testing a functionality of a network analysis tool, the method comprising: receiving an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; mapping the nodes to a first set of vectors, wherein the mapping determines a similarity function assigning connection scores to vector pairs; determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; and determining edges connecting nodes of the second graph, based on the similarity function.

2. The method of claim 1 , further comprising: using the network analysis tool to analyze a network comprising the nodes and the edges of the second graph.

3. The method of claim 1 or 2, wherein the first graph is a directed graph and determining edges connecting nodes of the second graph comprises determining, for each ordered node pair of the second graph: whether an edge connects a first node of the node pair to a second node of the node pair based on a first connection score; and whether the edge connects the second node of the node pair to the first node of the node pair based on a second connection score.

4. The method of any one of claims 1 to 3, wherein all vectors of the second set of vectors are determined based on randomly or pseudo-randomly drawing vectors from a multidimensional probability distribution approximated from the first set of vectors.

5. The method of claim 4, wherein all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors.

6. The method of claim 4, wherein all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors and adding a noise vector to the selected vector.

7. The method of claim 6, wherein the noise vector is randomly or pseudo- randomly drawn from a multidimensional Gaussian probability distribution.

8. The method of any one of claims 4 to 7, wherein the nodes of the first graph are assigned to communities and a node of the second graph corresponding to a selected vector of the first set inherits a respective community assignment of the node corresponding to the selected vector.

9. The method of any one of claims 4 to 8, wherein the edges of the first graph are assigned weights and an edge of the second graph connecting nodes corresponding to selected vectors of the first set inherits a respective weight of an edge of the first graph connecting the nodes corresponding to the selected vectors.

10. The method of claim 9, wherein if the nodes corresponding to the selected vectors are not connected in the first graph, said edge of the second graph is assigned a minimal weight among all edges of the first graph.

1 1 . The method of any one of claims 1 to 10, wherein determining edges connecting nodes of the second graph based on the similarity function further comprises comparing connection scores of pairs of nodes of the second graph with a threshold.

12. The method of claim 1 1, wherein the threshold is determined to discriminate, based on the similarity function, top-E node pairs of the first graph with relatively higher connection scores from the rest of node pairs, where E is a number of the edges in the first graph.

13. A computer-readable medium, storing instructions which if executed by a computer cause the computer to: load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs; determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; and determine edges connecting nodes of the second graph, based on the similarity function.

14. The computer-readable medium of claim 13, further storing instructions which if executed by the computer cause the computer to: execute a network analysis tool; and analyze a network comprising the nodes and the edges of the second graph.

15. A network analysis tool testing apparatus, comprising: a processor; and persistently stored instructions which if executed by the processor cause the processor to: load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs; determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; determine edges connecting nodes of the second graph, based on the similarity function; and store an output network dataset, the output network dataset defining the second graph.