WO2018151619A1 - Network analysis tool testing - Google Patents

Network analysis tool testing Download PDF

Info

Publication number
WO2018151619A1
WO2018151619A1 PCT/RU2017/000085 RU2017000085W WO2018151619A1 WO 2018151619 A1 WO2018151619 A1 WO 2018151619A1 RU 2017000085 W RU2017000085 W RU 2017000085W WO 2018151619 A1 WO2018151619 A1 WO 2018151619A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
vectors
nodes
edges
node
Prior art date
Application number
PCT/RU2017/000085
Other languages
French (fr)
Inventor
Alexander Nikolaevich Filippov
Mikhail DROBYSHEVSKY
Anton KORSHUNOV
Ilya KOZLOV
Xuecang ZHANG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2017/000085 priority Critical patent/WO2018151619A1/en
Priority to CN201780086994.5A priority patent/CN110313150B/en
Publication of WO2018151619A1 publication Critical patent/WO2018151619A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a process and a device for testing a functionality of a network analysis tool. The process involves receiving an input network dataset, the input network dataset defining a first graph comprising nodes and edges, wherein the edges represent connections between the nodes. The process further involves mapping the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determining edges connecting nodes of the second graph, based on the similarity function.

Description

NETWORK ANALYSIS TOOL TESTING
FIELD
The present disclosure relates to testing network analysis tools. In particular, the present disclosure relates to testing a functionality of a network analysis tool by analyzing artificially generated network data with the network analysis tool.
BACKGROUND
In order to test a network analysis tool such as a network mining tool, the network analysis tool may be fed with graphs representing different genuine or artificial networks and the analysis results may be evaluated to verify the functionality of the tool. In this regard, it may be beneficial to use a plurality of different graphs with similar properties to achieve statistically meaningful results.
SUMMARY
According to a first aspect of the present invention, there is provided a method for testing a functionality of a network analysis tool, the method comprising receiving an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, mapping the nodes to a first set of vectors, wherein the mapping determines a similarity function assigning connection scores to vector pairs, determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determining edges connecting nodes of the second graph, based on the similarity function.
In this regard, it is noted that the term "network analysis tool"* as used throughout the description and claims may equally refer to software, hardware, or a combination of software and hardware. For instance, the network analysis tool may be a combination of hardware and software, e.g., a computing device storing computer-readable instructions, which receives the input network dataset.
As the input network dataset describes a network topology, e.g., computing devices and communication links, the network topology may be analyzed to derive patterns that allow for an enhanced insight into the network topology. Furthermore, effects of changes in the network topology may be analyzed, e.g., by randomly or systematically modifying the network topology. For instance, an impact of different changes on the network topology may be simulated or critical changes may be extracted. Similarly, the network topology may be modified to increase robustness of the network topologies to adverse events such as node or communication link malfunctions.
However, it is clear to persons of skill in the art that communication networks are just one example for a network that could be analyzed using the network analysis tool. Because, in principle, the network analysis tool may be used to analyze a wide range of different networks. For example, the network analysis tool may be used to analyze a transport network, one or more linked webpages, biological systems, the syntax of a (natural) language, a retail network, an advertising network, or a social network and in fact any kind of network having a topology which is susceptible to be described by a graph.
Moreover, the term "similarity function" as used throughout the description and claims in particular refers to a function that quantifies the similarity between nodes by a connection score, wherein a higher similarity, which may be represented by a higher connection score, may indicate a higher likeliness of an edge between the nodes and hence, connecting the nodes. For example, the connection score may be a real number, wherein higher numbers indicate a higher likeliness of the respective nodes being connected by an edge.
In a first possible implementation form of the first aspect, the method further comprises using the network analysis tool to analyze a network comprising the nodes and the edges of the second graph. Hence, the network analysis tool may receive the second graph as input and derive one or more patterns from the second graph. As the second graph is derived from the first graph, the second graph may differ in size from the first graph, e.g., the second graph may comprise less than half, less than one-fifth of, less than one-tenth of, less than one-hundredth of, etc., or more than two times, more than five times, more than ten times, more than one-hundred times, etc., the number of nodes of the first graph, but still exhibit the same or similar properties/patterns as the first graph. For example, an analysis result of the first graph and an analysis result of the second graph may be compared and if the results are not consistent with each other, the network analysis tool may be adapted/corrected or discarded. Optionally, further tests may be performed to analyze a statistical meaning and/or the basis of observed deviations.
In a second possible implementation form of the first aspect, the first graph is a directed graph and determining edges connecting nodes of the second graph comprises determining, for each ordered node pair of the second graph, whether an edge connects a first node of the node pair to a second node of the node pair, based on a first connection score, and whether the edge connects the second node of the node pair to the first node of the node pair, based on a second connection score.
Accordingly, the presented method can be used to generate graphs having directed connections such as, for example, graphs representing data traffic such as graphs representing a (wireless) communication network, the distribution of goods, etc.
In a third possible implementation form of the first aspect, all vectors of the second set of vectors are determined based on randomly or pseudo-randomly drawing vectors from a multidimensional probability distribution approximated from the first set of vectors.
For instance, a multidimensional probability distribution may be fitted to the first set of vectors. Accordingly, a structure of the original graph may be preserved while the artificial graphs may be up-scaled or down-scaled. Moreover, the original graph may not be recoverable from the artificial graphs, thereby allowing to render features of a network open to the public while keeping the detailed network structure private/confidential. In a fourth possible implementation form of the first aspect, all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors.
For example, the second set may comprise a subset of the vectors of the first set and/or the vectors of the first set may be duplicated to generate a down-scaled or up-scaled artificial graph having similar properties.
In a fifth possible implementation form of the first aspect, all vectors of the second set of vectors are determined by selecting a vector of the first set of vectors and adding a noise vector to the selected vector.
Hence, besides generating a down-scaled or up-scaled artificial graph, graph properties may be randomly modified to provide the artificial graph with similar yet randomly modified properties compared to the original graph, thereby allowing systematically testing the significance of the network analysis tool results.
In a sixth possible implementation form of the first aspect, the noise vector is randomly or pseudo-randomly drawn from a multidimensional Gaussian probability distribution.
Hence, a multitude of statistically similar yet different artificial graphs may be generated that provide for graph samples within a region around the original graph.
In a seventh possible implementation form of the first aspect, the nodes of the first graph are assigned to communities and a node of the second graph corresponding to a selected vector of the first set inherits a respective community assignment of the node corresponding to the selected vector.
The communities may, for example, represent sets of densely connected nodes while the sets are more sparsely connected to each other than to the rest of the network. Hence, artificial graphs with communities having a similar (in a statistical sense) yet different structure compared to the original graph may be generated.
In a eighth possible implementation form of the first aspect, the edges of the first graph are assigned weights and an edge of the second graph connecting nodes corresponding to selected vectors of the first set inherits a respective weight of an edge of the first graph connecting the nodes corresponding to the selected vectors.
Hence, additional features of the network modelled by the graph may be encoded in the edge weights and at least partially preserved in the generated artificial graphs. For example, the weights may correspond to bandwith of a communication link, transport capacity, etc.
In an ninth possible implementation form of the first aspect, if the nodes corresponding to the selected vectors are not connected in the first graph, said edge of the second graph is assigned a minimal weight among all edges of the first graph.
Hence, a weight structure of the original graph may be maintained while generating artificial graphs having a similar characteristic than the original graph.
In a tenth possible implementation form of the first aspect, determining edges connecting nodes of the second graph based on the similarity function further comprises comparing connection scores of pairs of nodes of the second graph with a threshold.
For example, edges between nodes of the second graph may be added, if the connection scores of the respective node pairs are above the threshold.
In an eleventh possible implementation form of the first aspect, the threshold is determined to discriminate, based on the similarity function, top-E node pairs of the first graph with relatively higher connection scores from the rest of node pairs, where E is a number of the edges in the first graph. According to a second aspect of the present invention, there is provided a computer-readable medium, storing instructions which if executed by a computer cause the computer to load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, and determine edges connecting nodes of the second graph, based on the similarity function.
For example, the computer may be provided with a storage storing the instructions and the input network data set, or the computer may retrieve the input data set via a network connection. Moreover, the computer may be caused, by executing instructions stored on the computer-readable medium, to analyze network data and generate the input network dataset. For instance, the instructions may cause the computer to request data on computing entities and data connections between the computing entities of a network and to map the computing entities to nodes of the first graph and the data connections to edges of the first graph.
In a first possible implementation form of the second aspect, the computer-readable medium further stores instructions which if executed by the computer cause the computer to execute a network analysis tool and analyze a network comprising the nodes and the edges of the second graph. Hence, a modified, e.g., down-scaled or up-scaled second graph can be derived from the first graph, wherein the derived graph has similar properties as the first graph. Thus, a comparison between the results of an analysis of a network corresponding to the first graph and networks corresponding to derived second graphs can be used to verify that the network analysis tool derives similar patterns when analyzing networks having similar properties. According to a second aspect of the present invention, there is provided a network analysis tool testing apparatus, comprising a processor and persistently stored instructions which, if executed by the processor, cause the processor to load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes, map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs, determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph, determine edges connecting nodes of the second graph, based on the similarity function, and store an output network dataset, the output network dataset defining the second graph.
For instance, the apparatus may implement the method according to the first aspect and the implementation forms of the first aspect and use the second graph to test the network analysis tool. BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 shows a flow-chart of a process for generating an output graph from an input graph;
Fig. 2 shows exemplary input and output graphs used/generated by the process of Fig. 1 ;
Fig. 3 illustrates the application of the process of Fig. 1 for use in relation to a network mining tool; and Fig. 4 shows a block diagram of a network mining tool testing apparatus. DETAILED DESCRIPTION
Fig. 1 and Fig. 2 illustrate a process 10 for generating an output graph 12 from an input graph 14. As indicated at step 16, the process 10 starts with receiving an input network dataset defining the input graph 14. In this regard, the following notation is used in the remainder:
Figure imgf000009_0001
Figure imgf000010_0001
For instance, the input graph 14 illustarted in Fig. 2 may be a directed weighted graph G = (N, E) with a community structure, as indicated by the circles around nodes (0,1,2,3) and
(3,4,5,6,7), respectively. Thus, each node nt may have an assigned community label c, . However, it is to note that depending on the graph, no community label or a set of community labels
Figure imgf000010_0003
may be assigned to a node. Moreover, each edge nj— > n . may have an assigned weight wtj .
As indicated at step 18, the input graph 14 is mapped to vectors. In particular, the graph G = (N,E) may be embedded by mapping the nodes to real-valued vectors For
Figure imgf000010_0005
Figure imgf000010_0004
example, the directed weighted graph G may be embedded based on a bilinear link model, BLM, or using largescale information network embedding, LINE, although other techniques such as Deep Walk (cf. B. Perozzi, R. Al-Rfou, and S. Skiena, "Deepwalk: Online learning of Social Representations," in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014) or node2vec (cf. A. Grover and J. Leskovec, "Node2vec: Scalable feature learning for networks," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016) may be used instead.
For instance, a BLM designed specially for directed graphs has been suggested by O. U. Ivanov and S. O. Bartunov in "Learning Representations in Directed Networks," presented at the International Conference on Analysis of Images, Social Networks and Texts, and publisched on pages 196-207 in Volume 542 of the series "Communications in Computer and Information Science" by Springer in 2015, and is given by:
Figure imgf000010_0002
Here, the parameters are associated with input and output node representations,
Figure imgf000011_0004
hence allowing the embedding of a directed graph. Having regard to the joint link probability
Figure imgf000011_0005
the objective function of the embedding is a log-likelihood of the whole graph:
Figure imgf000011_0003
For the softmax approximation, a technique called noise contrastive estimation, NCE, presented by M. Gutmann and A. Hyvarinen in "Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models," in AISTATS, Volume 1 , 2010, page 6, may be used. This technique is directed at the estimation of unnormalized probabilistic models, treating the normalizing constant as an additional parameter. The key idea is to reduce the task of probability density learning to a binary classification problem, namely, distinguishing the data distribution pd(x) from a noise distribution p„(x) . Applied to graph embedding and assuming that noise samples are k times more frequent than data samples, the mixture distribution takes the form:
Figure imgf000011_0001
Moreover, the posterior probability that a sample x is from the data distribution is:
Figure imgf000011_0002
If a model ρθ (χ) with parameter set Θ aims at fitting to the data distribution, the posterior probability becomes a function of Θ :
Figure imgf000012_0001
Logistic regression may then be used to optimize the log-likelihood of the data against the noise:
Figure imgf000012_0002
This approximates pd(x) without a normalization requirement in regard to ρθ(χ) . In the BLM setup, the normalization constant becomes a new parameter
Figure imgf000012_0006
resulting in a new parameter set and the following probabilistic model:
Figure imgf000012_0005
Figure imgf000012_0004
Taking into account that pd corresponds to the graph edges and choosing pn as
Figure imgf000012_0007
tne new objective becomes:
Figure imgf000012_0003
Thus, the initial objective in BLM may be replaced with the NCE objective which can be efficiently optimized.
The LINE approach for embedding a directed weighted graph suggested by J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, in "Line: Largescale Information Network Embedding," published in the Proceedings of the 24th International Conference on World Wide Web, ACM, 2015, pages 1067-1077, is based on so-called first-order and second-order proximities between nodes. The first-order proximity of the nodes nt and n is indicated by the edge weight wy and characterizes the strength of the relation of the nodes:
Figure imgf000013_0001
The second-order proximity characterizes the relationship of a node n with its context nt :
Figure imgf000013_0002
This in fact coincides with the softmax approximation in the BLM.
Furthermore, it is suggested to optimize the two corresponding objectives separately:
Figure imgf000013_0003
The embedding vectors ul of both models may then be concatenated.
In order to reduce the summation complexity of the denominator, negative sampling, NEG, a technique known from language modelling and described by T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, in "Distributed Representations of Words and Phrases and their Compositionality," published in the Proceedings of the Advances in Neural Information Processing Systems, 2013, pages 31 1 1 -31 19, may be used. NEG is a simplification of NCE which does not approximate the softmax but nevertheless retains the quality of the embedding vectors. This is achieved by replacing the term kpn (x) with 1 and ignoring the normalizing constant which results in:
Figure imgf000014_0001
Substituting this into the expression for yields the NEG objective. For a single edge in
Figure imgf000014_0002
LINE this has the form:
Figure imgf000014_0003
However, it is noted that the above similarity function is just one example,
Figure imgf000014_0004
and other similarity functions may be used instead. Continuing with the example of Fig. 2, the input graph 14 G has | N |= 8 nodes and | E \= 28 edges, wherein two edges, (0,5) and (2,6), have weight 0.1 and all other edges have weight 10. As indicated above, the overlapping communities of the input graph 14 G are a = (0,1 ,2,3) and b = (3,4,5,6,7) with modularity QG = 0.2955 (cf. M. Drobyshevskiy, A. Korshunov, and
D. Turdakov, "Parallel Modularity Computation for Directed Weighted Graphs with Overlapping Communities," in Proceedings of the Institute for System Programming, Volume 28(6), 2016, pages 153-170). The input graph 14 G = (N, E) with N = {0,1,2,3,4,5,6,7} ,
£ = {(0,1), (1 ,0), (0,2),..., (7,6)} , weight labels {w01 = 10,w10 = 10,...,w05 = 0.1 ,...,w76 = 10} , and community labels may be embedded using
Figure imgf000014_0005
BLM with the following parameters: · epochs=600— number of epochs,
Figure imgf000015_0002
This may result in two vectors and a normalization parameter Z( per node:
Figure imgf000015_0003
Figure imgf000015_0001
The node pairs
Figure imgf000015_0011
may then be ranked by their similarity score. If embedding is successful, (almost) all edges
Figure imgf000015_0012
should have higher scores than non-edge pairs
Figure imgf000015_0010
and a threshold tG in s may be set which has rank \ E \ . Said threshold tG (approximately) separates edges
Figure imgf000015_0008
from non-edges
Figure imgf000015_0007
Then, the vectors
Figure imgf000015_0009
may be concatenated into a corresponding embedding vector for each node
Figure imgf000015_0013
Figure imgf000015_0014
Continuing with the above example, ranking all node pairs
Figure imgf000015_0006
N using the BLM similarity score computed as ) may result in the following ordered list:
Figure imgf000015_0005
Figure imgf000015_0004
Figure imgf000016_0001
The threshold tG in sy having rank \ E \ may thus be computed as ta - 0.63909784 6078 . Moreover, concatenating the vectors
Figure imgf000016_0005
into an embedding vector leads to:
Figure imgf000016_0004
Figure imgf000016_0002
As indicated at step 20, the embedding vectors may be used to determine the output graph
Figure imgf000016_0006
12. For example, in order to approximate the distribution of the embedding vectors when
Figure imgf000016_0007
sampling/drawing the vectors representing the output graph 12:
• the vectors
Figure imgf000016_0003
may be randomly sampled,
(Gaussian) noise may be added to the vectors and the resulting vectors may be
Figure imgf000016_0009
randomly sampled, or
a multi-dimensional probability distribution may be fitted to the vectors and the
Figure imgf000016_0008
vectors representing the output graph 12 may be randomly drawn from said distribution.
As the number of vectors representing the output graph 12 corresponds to the size m =\ M \ of the output graph 12 H = (M, F) , 16 vectors may be sampled/drawn in relation to the example of Fig. 2. For example, m vectors may be randomly picked (with repetitions) from the set of
Figure imgf000017_0001
embedding vectors
Figure imgf000017_0009
E.g., for each j = \ ..m , i may be drawn uniformly from [1 ,| N |] and an assignment may be made. As the vectors q . correspond to the nodes of the
Figure imgf000017_0008
output graph the output graph 12 (with edges yet to be
Figure imgf000017_0007
determined) may be defined as To determine the edges, each
Figure imgf000017_0006
vector qi may then be de-concatenated into 2 vectors of equal length
Figure imgf000017_0005
Continuing with the above example, | M |= 16 vectors may be selected from the set of
Figure imgf000017_0002
vectors
Figure imgf000017_0004
Figure imgf000017_0003
The selected vectors may then be assigned to the nodes of the output graph 12 H :
Figure imgf000017_0011
Figure imgf000017_0010
with M = {mx ,m2,...,mi 5} . For each node
Figure imgf000017_0013
the vector may be de-concatenated into
Figure imgf000017_0014
2 vectors of equal length
Figure imgf000017_0012
Figure imgf000018_0001
In this regard, it should however be noted that instead of concatenating and de-concatenating vectors, the selecting and assigning may also be performed on the basis of vector sets, where each set contains a first vector in relation to outgoing edges and a second vector in relation to ingoing edges.
As indicated at step 22, edges between nodes of the output graph 12 H = (M,-) may then be determined based on the similarity function by connecting top | F | pairs of nodes ranked by the similarity function. For instance, the similarity score may be computed for all pairs
Figure imgf000018_0005
as softmax For pairs e M x M with score z„ > tc ,
Figure imgf000018_0002
node ml may be connected to node m/ with an edge. The output graph 12 H = (M, F) may thus have a set of directed edges
Figure imgf000018_0003
Continuing with the above example, the similarity scores z . may be:
Figure imgf000018_0004
For pairs
Figure imgf000019_0003
with score node mj may be connected to node m} with:
Figure imgf000019_0002
Figure imgf000019_0001
As a result, the edges F = {(0,2), (0,4),..., (15,14)} between the nodes m1 and mj of the output graph 12 may be determined.
Moreover, weights may be assigned to each of the | F \ edges of the output graph 12 H by inheriting the edge weights of the corresponding edges of the input graph 14 G, if the input graph 14 G has different edge weights. Furthermore, for edges of the output graph 12 H which have no corresponding edges in the input graph 14 G, a minimal edge weight may be assigned, e.g., a minimal weight of all edges of the input graph 14 G. I.e., for each edge (k,l) e F , the corresponding edge weight may be determined by if the
Figure imgf000019_0010
Figure imgf000019_0005
corresponding node vectors are sampled from are connected by an
Figure imgf000019_0009
Figure imgf000019_0006
edge a minimal weight may be assigned.
Figure imgf000019_0008
Figure imgf000019_0007
Continuing with the above example:
Figure imgf000019_0004
Moreover, community labels may be assigned to each of the \ M \ nodes of graph H by inheriting corresponding community labels of the input graph 14 G, if the input graph 14 G has a community structure, i.e. community labels cj . I.e., for each node mj . e M , its community labels set CV may be determined by CV = C, , if the corresponding node vector qj is sampled from node vector rt .
Continuing with the above example:
Figure imgf000020_0001
As a result, the output graph 12 H = (M, F) with | M |= 16 nodes and | F |= 99 edges and two overlapping communities a = (0,5,6,14, 15) and b = (1 ,2,3,4,7,8,9,10,1 1 ,12,13,14) is determined, wherein the edges within the communities have high weight while the edges between the communities have low weight. Moreover, the output graph 12 H = (M, F) has a similar degree distribution and distribution of subgraphs with 3 nodes as the input graph 14 G - (N, E) and a relatively high modularity ( QH = 0.2374 ) in view of the communities.
In summary, the above process 10 of generating random output graphs 12 which have similar properties as a given input graph 14 provides the following benefits: automatic learning of degree distribution, subgraph distribution, and community structure from a given graph and reproducing them in synthetic graphs,
enabling synthetic graphs of arbitrary size, and
• supporting directed weighted graphs with communities. However, it is clear to the skilled person that the above process 10 is not limited to directed weighted graphs with communities but that the process 10 may also be applied to graphs which are not directed, graphs which have edges without weight, and/or graphs without community structure. Moreover, the (directed) (weighted) input graph 14 (with communities) may - in principle - be from any graph domain (social, mobile, biological, etc.).
As shown in Fig. 3, the output graph 12 may be used for the development and significance testing of network mining tools, e.g., in view of community detection. Furthermore, since the output graph 12 can be made arbitrarily large, the scalability of network mining tools can be evaluated by testing a network mining tool with multiple output graphs 12 of different size which are all generated from the same input graph 14 but differ in size. For instance, the network mining tool may be tested with output graphs 12 having half, one-fifth of, one-tenth of, one-hundredth of, etc., and/or two times, five times, ten times, one-hundred times, etc., the number of nodes of the input graph 14. The analysis results gained by analyzing such output graphs 12 may be compared and if the results are consistent with each other (for a sufficiently large number of output graphs 12), scalability of the network mining tool may be verified.
Moreover, data anonymization can be achieved which gives the possibility to make network features public without making the exact structure of the network public. Finally, if a large network may be difficult to analyze due to its size, the process 10 may be applied to create a representative sample, i.e., an output graph 12 of smaller size, of such a network with similar properties.
Fig. 4 shows a block diagram of a network mining tool testing apparatus 24. The apparatus 24 comprises a processor 26 and a computer-readable medium 28 persistently storing instructions which if executed by the processor 26 implement some or all steps of the above- described process 10.

Claims

1. A method for testing a functionality of a network analysis tool, the method comprising: receiving an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; mapping the nodes to a first set of vectors, wherein the mapping determines a similarity function assigning connection scores to vector pairs; determining, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; and determining edges connecting nodes of the second graph, based on the similarity function.
2. The method of claim 1 , further comprising: using the network analysis tool to analyze a network comprising the nodes and the edges of the second graph.
3. The method of claim 1 or 2, wherein the first graph is a directed graph and determining edges connecting nodes of the second graph comprises determining, for each ordered node pair of the second graph: whether an edge connects a first node of the node pair to a second node of the node pair based on a first connection score; and whether the edge connects the second node of the node pair to the first node of the node pair based on a second connection score.
4. The method of any one of claims 1 to 3, wherein all vectors of the second set of vectors are determined based on randomly or pseudo-randomly drawing vectors from a multidimensional probability distribution approximated from the first set of vectors.
5. The method of claim 4, wherein all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors.
6. The method of claim 4, wherein all vectors of the second set of vectors are determined by selecting vectors from the first set of vectors and adding a noise vector to the selected vector.
7. The method of claim 6, wherein the noise vector is randomly or pseudo- randomly drawn from a multidimensional Gaussian probability distribution.
8. The method of any one of claims 4 to 7, wherein the nodes of the first graph are assigned to communities and a node of the second graph corresponding to a selected vector of the first set inherits a respective community assignment of the node corresponding to the selected vector.
9. The method of any one of claims 4 to 8, wherein the edges of the first graph are assigned weights and an edge of the second graph connecting nodes corresponding to selected vectors of the first set inherits a respective weight of an edge of the first graph connecting the nodes corresponding to the selected vectors.
10. The method of claim 9, wherein if the nodes corresponding to the selected vectors are not connected in the first graph, said edge of the second graph is assigned a minimal weight among all edges of the first graph.
1 1 . The method of any one of claims 1 to 10, wherein determining edges connecting nodes of the second graph based on the similarity function further comprises comparing connection scores of pairs of nodes of the second graph with a threshold.
12. The method of claim 1 1, wherein the threshold is determined to discriminate, based on the similarity function, top-E node pairs of the first graph with relatively higher connection scores from the rest of node pairs, where E is a number of the edges in the first graph.
13. A computer-readable medium, storing instructions which if executed by a computer cause the computer to: load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs; determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; and determine edges connecting nodes of the second graph, based on the similarity function.
14. The computer-readable medium of claim 13, further storing instructions which if executed by the computer cause the computer to: execute a network analysis tool; and analyze a network comprising the nodes and the edges of the second graph.
15. A network analysis tool testing apparatus, comprising: a processor; and persistently stored instructions which if executed by the processor cause the processor to: load an input network dataset, the input network dataset defining a first graph, the first graph comprising nodes and edges, the edges representing connections between the nodes; map the nodes to a first set of vectors, wherein the mapping is based on a similarity function assigning connection scores to vector pairs; determine, based on the first set of vectors, a second set of vectors, wherein each vector of the second set of vectors represents a node of a second graph; determine edges connecting nodes of the second graph, based on the similarity function; and store an output network dataset, the output network dataset defining the second graph.
PCT/RU2017/000085 2017-02-20 2017-02-20 Network analysis tool testing WO2018151619A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU2017/000085 WO2018151619A1 (en) 2017-02-20 2017-02-20 Network analysis tool testing
CN201780086994.5A CN110313150B (en) 2017-02-20 2017-02-20 Network analysis tool testing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2017/000085 WO2018151619A1 (en) 2017-02-20 2017-02-20 Network analysis tool testing

Publications (1)

Publication Number Publication Date
WO2018151619A1 true WO2018151619A1 (en) 2018-08-23

Family

ID=58699234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2017/000085 WO2018151619A1 (en) 2017-02-20 2017-02-20 Network analysis tool testing

Country Status (2)

Country Link
CN (1) CN110313150B (en)
WO (1) WO2018151619A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969685A (en) * 2018-09-28 2020-04-07 苹果公司 Customizable rendering pipeline using rendering maps
US20210390461A1 (en) * 2017-06-30 2021-12-16 Visa International Service Association Graph model build and scoring engine

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7821966B2 (en) * 2007-03-19 2010-10-26 International Business Machines Corporation Method and apparatus for network topology discovery using closure approach
CN101877711B (en) * 2009-04-28 2013-08-28 华为技术有限公司 Social network establishment method and device, and community discovery method and device
CN101894123A (en) * 2010-05-11 2010-11-24 清华大学 Subgraph based link similarity quick approximate calculation system and method thereof
CN103034687B (en) * 2012-11-29 2017-03-08 中国科学院自动化研究所 A kind of relating module recognition methodss based on 2 class heterogeneous networks
CN104102745B (en) * 2014-07-31 2017-12-29 上海交通大学 Complex network community method for digging based on Local Minimum side

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
A. GROVER; J, LESKOVEC: "Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining", 2016, ACM, article "Node2vec: Scalable feature learning for networks"
ADITYA GROVER ET AL: "node2vec: Scalable Feature Learning for Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 July 2016 (2016-07-03), XP080711771 *
AVIN CHEN ET AL: "On social networks of program committees", SOCIAL NETWORK ANALYSIS AND MINING, SPRINGER VIENNA, VIENNA, vol. 6, no. 1, 8 April 2016 (2016-04-08), pages 1 - 20, XP036119936, ISSN: 1869-5450, [retrieved on 20160408], DOI: 10.1007/S13278-016-0328-Y *
B. PEROZZI; R. AL-RFOU; S. SKIENA: "Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining", 2014, ACM, article "Deepwalk: Online learning of Social Representations"
BRYAN PEROZZI ET AL: "DeepWalk", KNOWLEDGE DISCOVERY AND DATA MINING, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 24 August 2014 (2014-08-24), pages 701 - 710, XP058053805, ISBN: 978-1-4503-2956-9, DOI: 10.1145/2623330.2623732 *
J. TANG; M. QU; M. WANG; M. ZHANG; J. YAN; Q. MEI: "Proceedings of the 24th International Conference on World Wide Web", 2015, ACM, article "Line: Largescale Information Network Embedding", pages: 1067 - 1077
JIAN WU ET AL: "Internet routing resilience to failures", CONEXT 2007, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 10 December 2007 (2007-12-10), pages 1 - 12, XP058272722, ISBN: 978-1-59593-770-4, DOI: 10.1145/1364654.1364687 *
KARASUYAMA MASAYUKI ET AL: "Adaptive edge weighting for graph-based learning algorithms", MACHINE LEARNING, KLUWER ACADEMIC PUBLISHERS, BOSTON, US, vol. 106, no. 2, 18 November 2016 (2016-11-18), pages 307 - 335, XP036133778, ISSN: 0885-6125, [retrieved on 20161118], DOI: 10.1007/S10994-016-5607-3 *
M. DROBYSHEVSKIY; A. KORSHUNOV; D. TURDAKOV: "Parallel Modularity Computation for Directed Weighted Graphs with Overlapping Communities", PROCEEDINGS OF THE INSTITUTE FOR SYSTEM PROGRAMMING, vol. 28, no. 6, 2016, pages 153 - 170
M. GUTMANN; A. HYVARINEN: "Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models", AISTATS, vol. 1, 2010, pages 6
O. U. IVANOV; S. O. BARTUNOV: "International Conference on Analysis of Images, Social Networks and Texts", vol. 542, 2015, SPRINGER, article "Learning Representations in Directed Networks", pages: 196 - 207
T. MIKOLOV; 1. SUTSKEVER; K. CHEN; G. S. CORRADO; J. DEAN: "Distributed Representations of Words and Phrases and their Compositionality", PROCEEDINGS OF THE ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2013, pages 3111 - 3119

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390461A1 (en) * 2017-06-30 2021-12-16 Visa International Service Association Graph model build and scoring engine
US11847540B2 (en) * 2017-06-30 2023-12-19 Visa International Service Association Graph model build and scoring engine
CN110969685A (en) * 2018-09-28 2020-04-07 苹果公司 Customizable rendering pipeline using rendering maps
CN110969685B (en) * 2018-09-28 2024-03-12 苹果公司 Customizable rendering pipeline using rendering graphs

Also Published As

Publication number Publication date
CN110313150B (en) 2021-02-05
CN110313150A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
Dias et al. Concept lattices reduction: Definition, analysis and classification
Jensen et al. Linkage and autocorrelation cause feature selection bias in relational learning
Parker et al. Accelerating fuzzy-c means using an estimated subsample size
O'Neill et al. Common subtrees in related problems: A novel transfer learning approach for genetic programming
Ivanov et al. Understanding isomorphism bias in graph data sets
Nunes et al. GraphHD: Efficient graph classification using hyperdimensional computing
Yang et al. Metamorphic exploration of an unsupervised clustering program
Pelikan et al. Transfer learning, soft distance-based bias, and the hierarchical boa
WO2018151619A1 (en) Network analysis tool testing
Brown et al. LSHPlace: fast phylogenetic placement using locality-sensitive hashing
JP2010272004A (en) Discriminating apparatus, discrimination method, and computer program
Terziev Feature Generation using Ontologies during Induction of Decision Trees on Linked Data.
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
Saha et al. FLIP: active learning for relational network classification
Krész et al. Economic network analysis based on infection models
Carmona et al. Non-dominated multi-objective evolutionary algorithm based on fuzzy rules extraction for subgroup discovery
Kaedi et al. Holographic memory-based Bayesian optimization algorithm (HM-BOA) in dynamic environments
Khoshgoftaar et al. A novel feature selection technique for highly imbalanced data
Guerreiro et al. Recovering network topology and dynamics via sequence characterization
CN116049700B (en) Multi-mode-based operation and inspection team portrait generation method and device
Miao et al. Informative core identification in complex networks
Santana et al. Network measures for re-using problem information in EDAs
Wang et al. Ensemble clustering based on evidence theory
Ramos-Jiménez et al. Induction of decision trees using an internal control of induction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17722900

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17722900

Country of ref document: EP

Kind code of ref document: A1